Why is pandas.join() not merging correctly along index? - python

I'm trying to merge two dataframes, with identical indices into a single dataframe, but i cant seem to get it working. I expect the repeated values due to the resample function. The final dataframe then seems to have sorted the indices in ascending order which is fine. But why is it now 2x as long?
Here is the code:
Original dataframe:
default student balance income
0 No No 729.526495 44361.625074
1 No Yes 817.180407 12106.134700
2 No No 1073.549164 31767.138947
3 No No 529.250605 35704.493935
4 No No 785.655883 38463.495879
... ... ... ... ...
9995 No No 711.555020 52992.378914
9996 No No 757.962918 19660.721768
9997 No No 845.411989 58636.156984
9998 No No 1569.009053 36669.112365
9999 No Yes 200.922183 16862.952321
10000 rows × 4 columns
X = default[['balance','income']]
y = default['default']
boot = resample(X,y,replace=True,n_samples = len(X),random_state=1)
#convert to dataframe
boot = np.array(boot)
X = np.array(boot)[0]
y = np.array(boot)[1]
df = pd.DataFrame(X,index = X.index)
dfy = pd.DataFrame(y,index=y.index)
df = df.join(dfy)
X dataframe:
balance income
235 964.820253 34390.746035
5192 0.000000 29322.631394
905 1234.476479 31313.374575
7813 1598.020831 39163.361056
2895 1270.092810 16809.006452
... ... ...
7920 761.988491 39172.945235
1525 916.536937 20130.915258
4981 1037.573018 18769.579024
8104 912.065531 62142.061061
6990 1341.615739 26319.015588
[10000 rows x 2 columns]
Y dataframe
default
235 No
5192 No
905 No
7813 Yes
2895 No
... ...
7920 No
1525 No
4981 No
8104 No
6990 No
[10000 rows x 1 columns]
Combine to give this for some reason:
balance income default
0 729.526495 44361.625074 No
0 729.526495 44361.625074 No
0 729.526495 44361.625074 No
0 729.526495 44361.625074 No
1 817.180407 12106.134700 No
... ... ... ...
9998 1569.009053 36669.112365 No
9999 200.922183 16862.952321 No
9999 200.922183 16862.952321 No
9999 200.922183 16862.952321 No
9999 200.922183 16862.952321 No
20334 rows × 3 columns
Can someone explain where im going wrong?

Related

How to get distinct rows from pandas dataframe?

I am having trouble with getting distinct values from my dataframe.. Below is the code i currently use, in line 25(3rd of vier()) is the issue: I would like to show the top 10 fastest drivers based on their average heat(go-kart heat) time.
Input:
HeatNumber,NumberOfKarts,KartNumber,DriverName,Laptime
334,11,5,Monique,00:53.862
334,11,5,Monique,00:59.070
334,11,5,Monique,00:47.832
334,11,5,Monique,00:47.213
334,11,5,Monique,00:51.975
334,11,5,Monique,00:46.423
334,11,5,Monique,00:49.539
334,11,5,Monique,00:49.935
334,11,5,Monique,00:45.267
334,11,12,Robert-Jan,00:55.606
334,11,12,Robert-Jan,00:52.249
334,11,12,Robert-Jan,00:50.965
334,11,12,Robert-Jan,00:53.878
334,11,12,Robert-Jan,00:48.802
334,11,12,Robert-Jan,00:48.766
334,11,12,Robert-Jan,00:46.003
334,11,12,Robert-Jan,00:46.257
334,11,12,Robert-Jan,00:47.334
334,11,20,Katja,00:56.222
334,11,20,Katja,01:01.005
334,11,20,Katja,00:50.296
334,11,20,Katja,00:48.004
334,11,20,Katja,00:51.203
334,11,20,Katja,00:47.672
334,11,20,Katja,00:50.243
334,11,20,Katja,00:50.453
334,11,20,Katja,01:06.192
334,11,13,Bensu,00:56.332
334,11,13,Bensu,00:54.550
334,11,13,Bensu,00:52.023
334,11,13,Bensu,00:52.518
334,11,13,Bensu,00:50.738
334,11,13,Bensu,00:50.359
334,11,13,Bensu,00:49.307
334,11,13,Bensu,00:49.595
334,11,13,Bensu,00:50.504
334,11,17,Marit,00:56.740
334,11,17,Marit,00:52.534
334,11,17,Marit,00:48.331
334,11,17,Marit,00:56.204
334,11,17,Marit,00:49.066
334,11,17,Marit,00:49.210
334,11,17,Marit,00:45.655
334,11,17,Marit,00:46.261
334,11,17,Marit,00:46.837
334,11,11,Niels,00:58.518
334,11,11,Niels,01:01.562
334,11,11,Niels,00:51.238
334,11,11,Niels,00:48.808
Code:
import pandas as pd
import matplotlib.pyplot as plt
#Data
df = pd.read_csv('dataset_kartanalyser.csv')
df = df.dropna(axis=0, how='any')
df = df.join(df['Laptime'].str.split(':', 1, expand=True).rename(columns={0:'M', 1:'S'}))
df['M'] = df['M'].astype(int)
df['S'] = df['S'].astype(float)
df['Laptime'] = (df['M'] * 60) + df['S']
df.drop(['M', 'S'], axis=1, inplace=True)
#Funties
def twee():
print("Het totaal aantal karts = " + str(df['KartNumber'].nunique()))
print("Het aantal unique drivers = " + str(df['DriverName'].nunique()))
print("Het totaal aantal heats = " + str(df['HeatNumber'].nunique()))
def drie():
print("De 10 snelste Drivers obv individuele tijd zijn: ")
print((df.groupby('DriverName')['Laptime'].nsmallest(1)).nsmallest(10))
def vier():
print('De 10 snelste Drivers obv snelste heat gemiddelde:')
print((df.groupby(['DriverName', 'HeatNumber'])['Laptime'].mean().round(3)).nsmallest(10))
print(df)
HeatNumber NumberOfKarts KartNumber DriverName Laptime
0 334 11 5 Monique 53.862
1 334 11 5 Monique 59.070
2 334 11 5 Monique 47.832
3 334 11 5 Monique 47.213
4 334 11 5 Monique 51.975
... ... ... ... ... ...
4053 437 2 20 luuk 39.678
4054 437 2 20 luuk 39.872
4055 437 2 20 luuk 39.454
4056 437 2 20 luuk 39.575
4057 437 2 20 luuk 39.648
Output:
DriverName HeatNumber
giovanni 411 26.233
ryan 411 27.747
giovanni 408 27.938
papa 394 28.075
guus 406 28.998
Rob 427 29.371
Suus 427 29.416
Jan-jullius 394 29.428
Joep 427 29.934
Indy 423 29.991
The output i get i almost correct, expect that the driver "giovanni" occurs twice. I would like to only show the fastest avg heat time for each driver. Anyone who know how to do this?
ok so add drop_duplication on a column like this just need to add sort as well
df.sort_values('B', ascending=True)
.drop_duplicates('A', keep='first')
(df.groupby(['DriverName', 'HeatNumber'])['Laptime'].mean().round(3)sort_values('Laptime', ascending=True).drop_duplicates('DriverName', keep='first')).nsmallest(10))
You group the datas by Drivername and HeatNumber. See the HeatNumbers, one of them is 411 and another is 408. Because of that pandas understand they are exactly different. If you equals them, they will be one.

Subdivide rows into intervals of 50 and take mean

I want to subdivide my dataframe into intervals of 50, then perform the mean of these individual intervals, and then create a new dataframe.
2 3 4
0 20.229517 21.444166 19.528986
1 19.420929 21.029457 18.895041
2 19.214857 21.122784 19.228065
3 19.454653 21.148373 19.249720
4 20.152334 22.183264 20.149488
... ... ... ...
9995 20.673738 22.252024 21.587578
9996 21.948563 24.904633 23.962317
9997 24.318361 27.220770 25.322933
9998 24.570177 26.371695 23.503048
9999 23.274368 25.145500 22.028172
10000 rows × 3 columns
That is a 200 x 3, with the mean values. I would like to do this in pandas if possible. Thx in advance!
You can use integer division by 50 and pass to groupby with aggregate mean:
np.random.seed(2020)
df = pd.DataFrame(np.random.random(size=(10000, 3)))
print (df)
0 1 2
0 0.986277 0.873392 0.509746
1 0.271836 0.336919 0.216954
2 0.276477 0.343316 0.862159
3 0.156700 0.140887 0.757080
4 0.736325 0.355663 0.341093
... ... ...
9995 0.524481 0.454835 0.108934
9996 0.816516 0.354442 0.224834
9997 0.090518 0.887463 0.444833
9998 0.413673 0.315459 0.691306
9999 0.656559 0.113400 0.063397
[10000 rows x 3 columns]
df = df.groupby(df.index // 50).mean()
print (df)
0 1 2
0 0.537299 0.484187 0.512674
1 0.446181 0.503547 0.455493
2 0.504955 0.446041 0.464571
3 0.567661 0.494185 0.519785
4 0.485611 0.553636 0.396364
.. ... ... ...
195 0.516285 0.433178 0.526545
196 0.476906 0.474619 0.465957
197 0.497325 0.511659 0.490382
198 0.564468 0.453961 0.467758
199 0.520884 0.455529 0.479706
[200 rows x 3 columns]

Python pandas, data binning a column by X size

When fetching data for an orderbook I get it in this format
Price Size
--------------------
0 8549.61 0.107015
1 8549.32 0.100000
2 8549.31 0.060000
3 8548.66 0.013950
4 8548.65 0.064791
... ... ...
995 8401.40 0.313921
996 8401.19 0.767512
997 8401.17 0.001721
998 8401.10 0.166487
999 8401.03 0.002235
1000 rows × 2 columns
Is there a way to combine the values of price every $10 and the size would be a sum of that range?
For example
Price Size
--------------------
0 8550 0.107015
1 8560 0.100000
2 870 0.060000
3 8580 0.013950
I was looking at binning but that gave me weird results, thanks in advance!
You can use Pandas to do this.
df['Price'] = df['Price'].astype(str)
#determine the length inorder to modify the significant digit
len_str=len(str(int(float(df['Price'][0]))))
df['binned'] = df.groupby(df.Price.str[0:len_str-1])['Size'].transform('sum')
df['column'] = df.Price.str[0:len_str-1]+'0'
df=df.drop_duplicates(subset=['column', 'binned'])[['column','binned']].reset_index(drop=True)

How to create DataFrame from json data - dicts, lists and arrays within an array

I'm not able to get the data but only the headers from json data
Have tried to use json_normalize which creates a DataFrame from json data, but when I try to loop and append data the result is that I only get the headers.
import pandas as pd
import json
import requests
from pandas.io.json import json_normalize
import numpy as np
# importing json data
def get_json(file_path):
r = requests.get('https://www.atg.se/services/racinginfo/v1/api/games/V75_2019-09-29_5_6')
jsonResponse = r.json()
with open(file_path, 'w', encoding='utf-8') as outfile:
json.dump(jsonResponse, outfile, ensure_ascii=False, indent=None)
# Run the function and choose where to save the json file
get_json('../trav.json')
# Open the json file and print a list of the keys
with open('../trav.json', 'r') as json_data:
d = json.load(json_data)
print(list(d.keys()))
[Out]:
['#type', 'id', 'status', 'pools', 'races', 'currentVersion']
To get all data for the starts in one race I can use json_normalize function
race_1_starts = json_normalize(d['races'][0]['starts'])
race_1_starts_df = race_1_starts.drop('videos', axis=1)
print(race_1_starts_df)
[Out]:
distance driver.birth ... result.prizeMoney result.startNumber
0 1640 1984 ... 62500 1
1 1640 1976 ... 11000 2
2 1640 1968 ... 500 3
3 1640 1953 ... 250000 4
4 1640 1968 ... 500 5
5 1640 1962 ... 18500 6
6 1640 1961 ... 7000 7
7 1640 1989 ... 31500 8
8 1640 1960 ... 500 9
9 1640 1954 ... 500 10
10 1640 1977 ... 125000 11
11 1640 1977 ... 500 12
Above we get a DataFrame with data on all starts from one race. However, when I try to loop through all races in range in order to get data on all starts for all races, then I only get the headers from each race and not the data on starts for each race:
all_starts = []
for t in range(len(d['races'])):
all_starts.append([t+1, json_normalize(d['races'][t]['starts'])])
all_starts_df = pd.DataFrame(all_starts, columns = ['race', 'starts'])
print(all_starts_df)
[Out]:
race starts
0 1 distance ... ...
1 2 distance ... ...
2 3 distance ... ...
3 4 distance ... ...
4 5 distance ... ...
5 6 distance ... ...
6 7 distance ... ...
In output I want a DataFrame that is a merge of data on all starts from all races. Note that the number of columns can differ depending on which race, but that I expect in case one race has 21 columns and another has 20 columns - then the all_starts_df should contain all columns but in case a race do not have data for one column it should say 'NaN'.
Expected result:
[Out]:
race distance driver.birth ... result.column_20 result.column_22
1 1640 1984 ... 12500 1
1 1640 1976 ... 11000 2
2 2140 1968 ... NaN 1
2 2140 1953 ... NaN 2
3 3360 1968 ... 1500 NaN
3 3360 1953 ... 250000 NaN
If you want all columns you can try this.. (I find a lot more than 20 columns so I might have something wrong.)
all_starts = []
headers = []
for idx, race in enumerate(d['races']):
df = json_normalize(race['starts'])
df['race'] = idx
all_starts.append(df.drop('videos', axis=1))
headers.append(set(df.columns))
# Create set of all columns for all races
columns = set.union(*headers)
# If columns are missing from one dataframe add it (as np.nan)
for df in all_starts:
for c in columns - set(df.columns):
df[c] = np.nan
# Concatenate all dataframes for each race to make one dataframe
df_all_starts = pd.concat(all_starts, axis=0, sort=True)
Alternatively, if you know the names of the columns you want to keep, try this
columns = ['race', 'distance', 'driver.birth', 'result.prizeMoney']
all_starts = []
for idx, race in enumerate(d['races']):
df = json_normalize(race['starts'])
df['race'] = idx
all_starts.append(df[columns])
# Concatenate all dataframes for each race to make one dataframe
df_all_starts = pd.concat(all_starts, axis=0)

How to compare a row value to all the rows in a different column and separate all the rows that match using Pandas

I have a csv file, containing info of some banks. There are 9 columns in total. There are two columns, id and parentid, which contain the id for each bank and also it's parentid (parentid = 0, if the given bank is parent which is indicated by 'Type = T').
I need to separate all the banks into separate data frames such that all the children of a parent record should be in the same dataframe as the parent record.
Sample data:
type,symbol,price,quantity,expirydate,strikeprice,amendtime,id,parentid
T,ICICIBANK,1000,100,20121210,120,20121209103030,1234,0
T,AXISBANK,1000,100,20121210,120,20121209103031,1235,0
T,SBIBANK,1000,100,20121210,120,20121209103032,1236,0
P,ICICIBANK,1100,100,20121210,120,20121209103030,1237,1234
P,AXISBANK,1000,100,20121210,120,20121209103031,1238,1235
T,ICICIBANK,1000,100,20121210,120,20121209103035,1239,0
T,.CITIBANK,1000,101,20121210,120,20121209103036,1240,0
P,ICICIBANK,1100,100,20121210,120,20121209103030,1241,1234
P,ICICIBANK,1100,100,20121210,120,20121209103035,1242,1239
I have loaded the csv file using pandas and separated child and parent based on the Type column.
I am stuck with remaining part.This is what a sample dataframe looks like
groupby can help here:
df.groupby(np.where(df.parentid==0, df.id, df.parentid))
will give you an iterable of tuples (id, dataframe_for_that_id_and_its_childs).
Example:
for i, g in df.groupby(np.where(df.parentid==0, df.id, df.parentid)):
print(i)
print(g)
gives:
1234
type symbol price ... amendtime id parentid
0 T ICICIBANK 1000 ... 20121209103030 1234 0
3 P ICICIBANK 1100 ... 20121209103030 1237 1234
7 P ICICIBANK 1100 ... 20121209103030 1241 1234
[3 rows x 9 columns]
1235
type symbol price ... amendtime id parentid
1 T AXISBANK 1000 ... 20121209103031 1235 0
4 P AXISBANK 1000 ... 20121209103031 1238 1235
[2 rows x 9 columns]
1236
type symbol price ... amendtime id parentid
2 T SBIBANK 1000 ... 20121209103032 1236 0
[1 rows x 9 columns]
1239
type symbol price ... amendtime id parentid
5 T ICICIBANK 1000 ... 20121209103035 1239 0
8 P ICICIBANK 1100 ... 20121209103035 1242 1239
[2 rows x 9 columns]
1240
type symbol price ... amendtime id parentid
6 T .CITIBANK 1000 ... 20121209103036 1240 0
[1 rows x 9 columns]
This will split the dataframe into a dictionary of dataframes (the keys are the parent id):
selection = df['parentid'].mask(df['parentid']==0, df['id'])
{sel: df.loc[selection == sel] for sel in selection.unique()}

Categories

Resources