This question already has answers here:
Pandas DENSE RANK
(4 answers)
Closed 2 years ago.
I had a dataframe which has 294467 rows and 7 columns. I want to assign the same number to products' brands which has the same brand.
Here is the example of my dataframe:
overall ... brand
0 5.0 ... Pirmal Healthcare
1 5.0 ... Pirmal Healthcare
2 5.0 ... Pirmal Healthcare
3 5.0 ... Pirmal Healthcare
4 4.0 ... Pirmal Healthcare
... ... ...
294975 4.0 ... Gentlemen's Hardware
294976 5.0 ... Benefit Cosmetics
294977 1.0 ... Salon Perfect
294978 1.0 ... GBSTORE
294979 1.0 ... GBSTORE
[294467 rows x 7 columns]
Final result should be:
overall ... brand
0 5.0 ... 1
1 5.0 ... 1
2 5.0 ... 1
3 5.0 ... 1
4 4.0 ... 1
... ... ...
294975 4.0 ... 7839
294976 5.0 ... 7840
294977 1.0 ... 7841
294978 1.0 ... 7842
294979 1.0 ... 7842
[294467 rows x 7 columns]
For this result, I sorted my dataframe according to brand. Then assigned different numbers to them with this code:
sorted_copy = copy.sort_values('brand')
random_number=0
first=""
for f, row in sorted_copy.iterrows():
i=row['brand']
if(first == i):
sorted_copy.at[f, 'brand'] = random_number
elif(first !=i):
first=i
random_number= random_number +1
sorted_copy.at[f, 'brand'] = random_number
However, this process took maybe an hour and half. Is there any solution to get this result in a short time? Can anyone help?
Thank you.
df['brand'] = df['brand'].astype("category").cat.codes
should work fine.
If I have df1:
A B C D
0 4.51 6.212 3.12 1
1 3.12 3.444 1.12 1
2 6.98 7.413 7.02 0
3 4.51 8.916 5.12 1
....
n1 ~ 2000
and df2
A B C D
0 4.51 6.212 3.12 1
1 3.12 3.444 1.12 1
2 6.98 7.413 7.02 0
3 4.51 8.916 5.12 1
....
n2 = 10000+
And have to perform an operation like:
df12 =
df1[0,A]-df2[0,A] df1[0,B]-df2[0,B] df1[0,C]-df2[0,C]....
df1[0,A]-df2[1,A] df1[0,B]-df2[1,B] df1[0,C]-df2[1,C]
...
df1[0,A]-df2[n2,A] df1[0,B]-df2[n2,B] df1[0,C]-df2[n2,C]
...
df1[1,A]-df2[0,A] df1[1,B]-df2[0,B] df1[1,C]-df2[0,C]....
df1[1,A]-df2[1,A] df1[1,B]-df2[1,B] df1[1,C]-df2[1,C]
...
df1[1,A]-df2[n2,A] df1[1,B]-df2[n2,B] df1[1,C]-df2[n2,C]
...
df1[n1,A]-df2[0,A] df1[n1,B]-df2[0,B] df1[n1,C]-df2[0,C]....
df1[n1,A]-df2[1,A] df1[n1,B]-df2[1,B] df1[n1,C]-df2[1,C]
...
df1[n1,A]-df2[n2,A] df1[n1,B]-df2[n2,B] df1[n1,C]-df2[n2,C]
Where every row in df1 is compared against every row in df2 producing a score.
What would be the best way to perform this operation using either pandas or vaex/equivalent?
Thanks in advance!
Broadcasting is the way to go:
pd.DataFrame((df1.to_numpy()[:,None] - df2.to_numpy()[None,...]).reshape(-1, df1.shape[1]),
columns = df2.columns,
index = pd.MultiIndex.from_product((df1.index,df2.index))
)
Output (for df1 the three first rows, df2 the two first rows):
A B C D
0 0 0.00 0.000 0.0 0.0
1 1.39 2.768 2.0 0.0
1 0 -1.39 -2.768 -2.0 0.0
1 0.00 0.000 0.0 0.0
2 0 2.47 1.201 3.9 -1.0
1 3.86 3.969 5.9 -1.0
I would use openpyxl
This loop would do
for row in sheet.iter_rows(min_row=minr, min_col=starting col, max_col=finshing col, max_row=maxr):
for cell in row:
df1 = cell.value
for row in sheet.iter_rows(min_row=minr, min_col=starting col, max_col=finshing col, max_row=maxr):
for cell in row:
df2 = cell.value
from here you want do what , create new values ? put them where? the code here points on them
An interesting question, that vaex can actually solve quite memory efficient (although we should be able to require practically no memory in the future).
Let's start creating the vaex dataframes, and increase the numbers a bit, 2,000 and 200,000 rows.
import vaex
import numpy as np
names = "ABCD"
N = 2000
M = N * 100
print(f'{N*M:,} rows')
400,000,000 rows
df1 = vaex.from_dict({name + '1': np.random.random(N) * 6 for name in names})
# We add a virtual range column for joining (requires no memory)
df1['i1'] = vaex.vrange(0, N, dtype=np.int32)
print(df1)
# A1 B1 C1 D1 i1
0 5.863720056164963 3.362898169138271 3.2880444660598784 0.1482153863632898 0
1 5.873731485927979 5.669031702051764 5.696571067838359 1.0310578585207142 1
2 4.513310303419997 4.466469647700519 5.047406986222205 3.4417402924374407 2
3 0.43402400660624174 1.157476656465433 2.179139262842482 1.1666706679131253 3
4 3.3698854360766526 2.203558794966768 0.39649910973621827 2.5576740079630502 4
... ... ... ... ... ...
1,995 4.836227485536714 4.093067389612236 5.992282902119859 1.3549691660861871 1995
1,996 1.1157617217838995 1.1606619796004967 3.2771620798090533 4.249631266421745 1996
1,997 4.628846984287445 4.019449674317169 3.7307713985954947 3.7702606362049362 1997
1,998 1.3196727531762933 2.6758762345410565 3.249315566523623 2.6501467681546123 1998
1,999 5.013902261001965 0.07240435410582324 0.8744364486077243 2.6801177594876218 1999
df2 = vaex.from_dict({name + '2': np.random.random(M) * 6 for name in names})
df2['i2'] = vaex.vrange(0, M, dtype=np.int32)
print(df2)
# A2 B2 C2 D2 i2
0 2.6928366822161234 3.227321076730826 5.944154034728931 3.3584747680090814 0
1 4.824761575636117 2.960087600265437 3.492601702004836 1.054879207993813 1
2 4.33510613806528 0.46883404117516103 5.632361155412736 0.436374137912523 2
3 0.0422709543055384 2.5319705848478855 3.5596949266321216 2.5364151309685354 3
4 2.749335843510271 3.5446979145461146 2.550223710733076 5.02069361871291 4
... ... ... ... ... ...
199,995 5.32205669155252 4.321667991189379 2.1192950613326182 5.937425946574905 199995
199,996 0.10746705113978328 4.104809740632655 0.6282195590464632 3.9603843538752974 199996
199,997 5.74108180127652 3.5863223687990136 4.64031507831471 4.610807268734913 199997
199,998 5.839402924722246 2.630974123991404 4.50411700551054 3.0960758923309983 199998
199,999 1.6954091816701466 1.8054911765387567 4.300317113825266 4.900845720973579 199999
Now we create our 'master' vaex dataframe, that requires no memory at all, it's made of a virtual column and two expressions (stored as virtual columns):
df = vaex.from_arrays(i=vaex.vrange(0, N*M, dtype=np.int64))
df['i1'] = df.i // M # index to df1
df['i2'] = df.i % M # index to df2
print(df)
# i i1 i2
0 0 0 0
1 1 0 1
2 2 0 2
3 3 0 3
4 4 0 4
... ... ... ...
399,999,995 399999995 1999 199995
399,999,996 399999996 1999 199996
399,999,997 399999997 1999 199997
399,999,998 399999998 1999 199998
399,999,999 399999999 1999 199999
Unfortunately vaex cannot use these integer indices as lookups for joining directly, it has to go through as hashmap. So there is room for improvement here for vaex. If vaex could do this, we could scale this idea up to trillions of rows.
print(f"The next two joins require ~{len(df)*8*2//1024**2:,} MB of RAM")
The next two joins require ~6,103 MB of RAM
df_big = df.join(df1, on='i1')
df_big = df_big.join(df2, on='i2')
print(df_big)
# i i1 i2 A1 B1 C1 D1 A2 B2 C2 D2
0 0 0 0 5.863720056164963 3.362898169138271 3.2880444660598784 0.1482153863632898 2.6928366822161234 3.227321076730826 5.944154034728931 3.3584747680090814
1 1 0 1 5.863720056164963 3.362898169138271 3.2880444660598784 0.1482153863632898 4.824761575636117 2.960087600265437 3.492601702004836 1.054879207993813
2 2 0 2 5.863720056164963 3.362898169138271 3.2880444660598784 0.1482153863632898 4.33510613806528 0.46883404117516103 5.632361155412736 0.436374137912523
3 3 0 3 5.863720056164963 3.362898169138271 3.2880444660598784 0.1482153863632898 0.0422709543055384 2.5319705848478855 3.5596949266321216 2.5364151309685354
4 4 0 4 5.863720056164963 3.362898169138271 3.2880444660598784 0.1482153863632898 2.749335843510271 3.5446979145461146 2.550223710733076 5.02069361871291
... ... ... ... ... ... ... ... ... ... ... ...
399,999,995 399999995 1999 199995 5.013902261001965 0.07240435410582324 0.8744364486077243 2.6801177594876218 5.32205669155252 4.321667991189379 2.1192950613326182 5.937425946574905
399,999,996 399999996 1999 199996 5.013902261001965 0.07240435410582324 0.8744364486077243 2.6801177594876218 0.10746705113978328 4.104809740632655 0.6282195590464632 3.9603843538752974
399,999,997 399999997 1999 199997 5.013902261001965 0.07240435410582324 0.8744364486077243 2.6801177594876218 5.74108180127652 3.5863223687990136 4.64031507831471 4.610807268734913
399,999,998 399999998 1999 199998 5.013902261001965 0.07240435410582324 0.8744364486077243 2.6801177594876218 5.839402924722246 2.630974123991404 4.50411700551054 3.0960758923309983
399,999,999 399999999 1999 199999 5.013902261001965 0.07240435410582324 0.8744364486077243 2.6801177594876218 1.6954091816701466 1.8054911765387567 4.300317113825266 4.900845720973579
Now we have our big dataframe, and we only need to do the computation, which is using virtual columns, and thus require no extra memory.
# add virtual colums (which require no memory)
for name in names:
df_big[name] = df_big[name + '1'] - df_big[name + '2']
print(df_big[['A', 'B', 'C', 'D']])
# A B C D
0 3.17088337394884 0.13557709240744487 -2.6561095686690526 -3.2102593816457916
1 1.038958480528846 0.40281056887283384 -0.20455723594495767 -0.9066638216305232
2 1.5286139180996834 2.8940641279631096 -2.344316689352858 -0.2881587515492332
3 5.821449101859425 0.8309275842903854 -0.2716504605722432 -2.3881997446052456
4 3.1143842126546923 -0.18179974540784372 0.7378207553268026 -4.87247823234962
... ... ... ... ...
399,999,995 -0.30815443055055525 -4.249263637083556 -1.244858612724894 -3.257308187087283
399,999,996 4.906435209862181 -4.032405386526832 0.2462168895612611 -1.2802665943876756
399,999,997 -0.7271795402745553 -3.51391801469319 -3.765878629706985 -1.930689509247291
399,999,998 -0.8255006637202813 -2.5585697698855805 -3.629680556902816 -0.41595813284337657
399,999,999 3.318493079331818 -1.7330868224329334 -3.4258806652175418 -2.220727961485957
If we had to do this all in memory, how much RAM would that have required?
print(f"This would otherwise require {len(df_big) * (4*3*8)//1023**2:,} MB of RAM")
This would otherwise require 36,692 MB of RAM
So, quite efficient I would say, and in the future it would be interesting to see if we can do the join more efficiently, and require practically zero RAM for this problem.
I have several tables imported from an Excel file:
df = pd.read_excel(ffile, 'Constraints', header = None, names = range(13))
table_names = ['A', ...., 'W']
groups = df[0].isin(table_names).cumsum()
tables = {g.iloc[0,0]: g.iloc[1:] for k,g in df.groupby(groups)}
This is the first time I've tried to read multiple tables from a single sheet, so I'm not sure if this is the best manner. If printed like this:
for k,v in tables.items():
print("table:", k)
print(v)
print()
The output is:
table: A
0 1 2 ... 10 11 12
2 Sxxxxxx Dxxx 21 20 ... 22 19 22
3 Rxxx Sxxxx / Lxxx Cxxxxxxxxxxx 7 7 ... 7 7 7
4 AVG Sxxxx per xxx # xx% Pxxxxxxxxxxxx 5 X 5.95 5.95 ... 5.95 5.95 5.95
...
...
...
table: W
0 1 2 ... 10 11 12
6 Sxxxxxx Dxxx 21 20 ... 22 19 22
7 Rxxx Sxxxx / Lxxx Cxxxxxxxxxxx 30 30 ... 30 30 30
8 AVG Sxxxx per xxx # xx% Pxxxxxxxxxxxx 5 x 28.5 28.5 ... 28.5 28.5 28.5
I tried to combine them all into one DataFrame using dfa = pd.DataFrame(tables['A'])
for each table, and then using fdf = pd.concat([dfa,...,dwf], keys =['A', ... 'W']).
The keys are hierarchically placed, but the autonumbered index column inserts itself after the keys and before the first column:
0 1 2 ... 10 11 12
A 2 Sxxxxxx Dxxx 21 20 ... 22 19 22
3 Rxxx Sxxxx / Lxxx Cxxxxxxxxxxx 7 7 ... 7 7 7
4 AVG Sxxxx per xxx # xx% Pxxxxxxxxxxxx 5 X 5.95 5.95 ... 5.95 5.95 5.95
I would like to convert the keys to an actual column and switch places with the pandas numbered index, but I'm not sure how to do that. I've tried pd.reset_index() in various configurations, but am wondering if I maybe constructed the tables wrong in the first place?
If any of this information is not necessary, please let me know and I will remove it. I'm trying to follow the MCV guidelines and am not sure how much people need to know.
After you get the your tables, Just do
pd.concat(tables)
I want to add (ideally get the mean of the sum) of several column values starting with my index i,
investmentlength=list(range(1,13,1))
returns=list()
for i in range(0,len(stocks2)):
if stocks2['Startpoint'][i]==1:
nextmonth=nextmonth+stocks2['RET'][i+1]+stocks2['RET'][i+2]+stocks2['RET'][i+3]+....
counter+=1
Is there a way to give the beginning Index and the end index and prob step size and then sum it all in one command instead of copy and paste to death? I wanted to go trough all the different investment lengths and put the avg returns in the empty list.
SHRCD EXCHCD SICCD PRC VOL RET SHROUT \
DATE PERMNO
1970-08-31 10559.0 10.0 1.0 5311.0 35.000 1692.0 0.030657 12048.0
12626.0 10.0 1.0 5411.0 46.250 926.0 0.088235 6624.0
12749.0 11.0 1.0 5331.0 45.500 5632.0 0.126173 34685.0
13100.0 11.0 1.0 5311.0 22.000 1759.0 0.171242 15107.0
13653.0 10.0 1.0 5311.0 13.125 141.0 0.220930 1337.0
13936.0 11.0 1.0 2331.0 11.500 270.0 -0.053061 3942.0
14322.0 11.0 1.0 5311.0 64.750 6934.0 0.024409 154187.0
16969.0 10.0 1.0 5311.0 42.875 1069.0 0.186851 13828.0
17072.0 10.0 1.0 5311.0 14.750 777.0 0.026087 5415.0
17304.0 10.0 1.0 5311.0 24.875 1939.0 0.058511 8150.0
MV XRET IB ... PE2 \
DATE PERMNO ...
1970-08-31 10559.0 421680.000 0.025357 NaN ... 13.852692
12626.0 306360.000 0.082935 NaN ... 13.145312
12749.0 1578167.500 0.120873 NaN ... 25.970466
13100.0 332354.000 0.165942 NaN ... 9.990711
13653.0 17548.125 0.215630 NaN ... 6.273570
13936.0 45333.000 -0.058361 NaN ... 6.473123
14322.0 9983608.250 0.019109 NaN ... 22.204047
16969.0 592875.500 0.181551 NaN ... 11.948061
17072.0 79871.250 0.020787 NaN ... 8.845526
17304.0 202731.250 0.053211 NaN ... 8.641655
lagPE1 lagPE2 lagMV lagSEQ QUINTILE1 \
DATE PERMNO
1970-08-31 10559.0 13.852692 13.852692 412644.000 264.686 4
12626.0 13.145312 13.145312 281520.000 164.151 4
12749.0 25.970466 25.970466 1404742.500 367.519 5
13100.0 9.990711 9.990711 288921.375 414.820 3
13653.0 6.273570 6.273570 14372.750 24.958 1
13936.0 6.473123 6.473123 48289.500 76.986 1
14322.0 22.204047 22.204047 9790874.500 3439.802 5
16969.0 11.948061 11.948061 499536.500 NaN 4
17072.0 8.845526 8.845526 77840.625 NaN 3
17304.0 8.641655 8.641655 191525.000 307.721 3
QUINTILE2 avgvol avg Startpoint
DATE PERMNO
1970-08-31 10559.0 4 9229.057592 1697.2 0
12626.0 4 3654.367470 894.4 0
12749.0 5 188206.566860 5828.6 0
13100.0 3 94127.319048 3477.2 0
13653.0 1 816.393162 268.8 0
13936.0 1 71547.050633 553.2 0
14322.0 5 195702.521519 6308.8 0
16969.0 4 3670.297872 2002.0 0
17072.0 3 3774.083333 3867.8 0
17304.0 3 12622.112903 1679.4 0
I'm using Pandas for data analysis. I have an input file like this snippet:
VEH SEC POS ACCELL SPEED
2 8.4 36.51 -0.2929 27.39
3 8.4 23.57 -0.7381 33.09
4 8.4 6.18 0.6164 38.8
1 8.5 47.76 0 25.57
I need to reorganize the data so that the rows are the unique (ordered) values from SEC as the 1st column, and then the other columns would be VEH1_POS, VEH1_SPEED, VEH1_ACCELL, VEH2_POS, VEH2_SPEED, VEH2_ACCELL, etc.:
TIME VEH1_POS VEH1_SPEED VEH1_ACCEL VEH2_POS, VEH2_SPEED, etc.
0.1 6.2 3.7 0.0 7.5 2.1
0.2 6.8 3.2 -0.5 8.3 2.1
etc.
So, for example, the value for VEH1_POS for each row in the new dataframe would be filled in by selecting values from the POS column in the original dataframe using the row where the SEC value matches the TIME value for the row in the new dataframe and the VEH value == 1.
To set up the rows in the new data frame I'm doing this:
start = inputdf['SIMSEC'].min()
end = inputdf['SIMSEC'].max()
time_steps = frange(start, end, 0.1)
outputdf['TIME'] = time_steps
But I'm lost at how to select the right values from the input dataframe and create the rest of the new dataframe for further analysis. Note also that the input file will NOT have data for every VEH for every SEC (time stamp). So the solution needs to handle that as well. My best guess was:
outputdf['veh1_pos'] = np.where((inputdf['VEH NO'] == 1) & (inputdf['SIMSEC'] == row['Time Step']))
but that doesn't work.
import pandas as pd
# your data
# ==========================
print(df)
Out[272]:
VEH SEC POS ACCELL SPEED
0 2 8.4 36.51 -0.2929 27.39
1 3 8.4 23.57 -0.7381 33.09
2 4 8.4 6.18 0.6164 38.80
3 1 8.5 47.76 0.0000 25.57
# reshaping
# ==========================
result = df.set_index(['SEC','VEH']).unstack()
Out[278]:
POS ACCELL SPEED
VEH 1 2 3 4 1 2 3 4 1 2 3 4
SEC
8.4 NaN 36.51 23.57 6.18 NaN -0.2929 -0.7381 0.6164 NaN 27.39 33.09 38.8
8.5 47.76 NaN NaN NaN 0 NaN NaN NaN 25.57 NaN NaN NaN
So here, the column has multi-level index where 1st level is POS, ACCELL, SPEED and 2nd level is VEH=1,2,3,4.
# if you want to rename the column
temp_z = result.columns.get_level_values(0)
temp_y = result.columns.get_level_values(1)
temp_x = ['VEH'] * len(temp_y)
result.columns = ['{}{}_{}'.format(x,y,z) for x,y,z in zip(temp_x, temp_y, temp_z)]
Out[298]:
VEH1_POS VEH2_POS VEH3_POS VEH4_POS VEH1_ACCELL VEH2_ACCELL VEH3_ACCELL VEH4_ACCELL VEH1_SPEED VEH2_SPEED VEH3_SPEED VEH4_SPEED
SEC
8.4 NaN 36.51 23.57 6.18 NaN -0.2929 -0.7381 0.6164 NaN 27.39 33.09 38.8
8.5 47.76 NaN NaN NaN 0 NaN NaN NaN 25.57 NaN NaN NaN