Python - speed up iteration between nested dictionary

Python - speed up iteration between nested dictionary - python

I want ask you for some advice to speed up my code. I know that you can see many of mistakes, but I need your knowledge and help where problem lies and how I can improve this code.
Background - what application creates:
Use OpenPyXL
Open excel file, read data and put it into nested dictionary:
2a. 1st level for rows
2b. 2nd level for items
Example:
{1: '#5C\Qopen#', 2: '20386239', 3: '3000133215', 4: 'RA', 5: None,
6: 'Vendor2', 7: 'IM45', 8: '#FR\QNot due#', 9: None, 10: None, 11:
'E1', 12: 'DNS', 13: datetime.datetime(2019, 12, 27, 0, 0), 14:
datetime.datetime(2019, 12, 26, 0, 0), 15: -21501, 16: 'GBP', 17:
-21501, 18: 'GBP', 19: datetime.datetime(2019, 12, 26, 0, 0), 20: datetime.datetime(2020, 2, 9, 0, 0)}
{2: '#5C\Qopen#', 2: '20386239',
3: '3000133215', 4: 'RA', 5: None, 6: 'Vendor1', 7: 'IM45', 8:
'#FR\QNot due#', 9: None, 10: None, 11: 'E1', 12: 'DNS', 13:
datetime.datetime(2019, 12, 27, 0, 0), 14: datetime.datetime(2019, 12,
26, 0, 0), 15: -21501, 16: 'GBP', 17: -21501, 18: 'GBP', 19:
datetime.datetime(2019, 12, 26, 0, 0), 20: datetime.datetime(2020, 2,
9, 0, 0)}
{3: '#5C\Qopen#', 2: '20386239',
3: '3000133215', 4: 'RA', 5: None, 6: 'Vendor1', 7: 'IM45', 8:
'#FR\QNot due#', 9: None, 10: None, 11: 'E1', 12: 'DNS', 13:
datetime.datetime(2019, 12, 27, 0, 0), 14: datetime.datetime(2019, 12,
26, 0, 0), 15: -21501, 16: 'EUR', 17: -21501, 18: 'GBP', 19:
datetime.datetime(2019, 12, 26, 0, 0), 20: datetime.datetime(2020, 2,
9, 0, 0)}
Script should look via whole data - comparing Vendor & Currency and look whether in particular vendor we have different currencies (I mean e.g. when Vendor 1 doesn't have 100% one specific currency [GBP or other]
When this will happened - put text "something something" for example in column "17" in the same row where different currency is
Mainly my code works properly but its very slow. I mean when I need compare in the same time file with 30 000 rows.
Do you know how I can improve it?
Thank you
next_row2 = 1
numerkolumny = 1
nastepny = 1
numer_vendora = 1
ilosc_gbp = 0
ilosc_inne = 0
linijkadanych = {}
lista_vendorow = {}
for zmienna2 in progressbar.progressbar(assets2, redirect_stdout=True):
for iteracja in assets2:
if assets2[zmienna2][6] not in lista_vendorow.values():
if nastepny < len(assets2):
if assets2[zmienna2][6] == assets2[nastepny+1][6]:
if assets2[nastepny+1][16] == "GBP": # JESLI ZNALAZLES GBP, POLICZ DO GBP
ilosc_gbp = ilosc_gbp + 1
nastepny = nastepny + 1
else: # JESLI ZNALAZLES INNA WALUTE, POLICZ DO INNEJWALUTY
ilosc_inne = ilosc_inne + 1
nastepny = nastepny + 1
else:
nastepny = nastepny + 1
if nastepny >= len(assets2): # JESLI PRZEITEROWALES PRZEZ WSZYSTKIE WIERSZE, OBLICZ WYNIK
suma_walut = ilosc_gbp + ilosc_inne # SUMUJ WSZYSTKIE WALUTY
# JESLI ZNAJDZIE ODCHYLELNIA - RAPORTUJ!
if (suma_walut != ilosc_gbp) and (suma_walut != ilosc_inne):
for waluty in assets2: # nr wiersza
for waluty2 in assets2[waluty]: # nr kolumny
if assets2[waluty][6] == assets2[zmienna2][6]:
if ilosc_gbp > ilosc_inne:
result_tab.cell(column=17, row=waluty+1, value="Waluta other than GBP. Check!").font = style_blad_bold
else:
result_tab.cell(column=17, row=waluty+1, value="Other currencies between GBP!. Check!").font = style_blad_bold
lista_vendorow[numer_vendora] = assets2[zmienna2][6]
ilosc_gbp = 0 # ZERUJ ZMIENNE, LICZYMY NOWEGO VENDORA
ilosc_inne = 0 # ZERUJ ZMIENNE, LICZYMY NOWEGO VENDORA
nastepny = 1 # ZERUJ ZMIENNE, LICZYMY NOWEGO VENDORA
numer_vendora = numer_vendora + 1

Related

Avoiding itertuples and predefined numpy arrays when iterating pandas dataframe quickly

I am trying to extend the first dataframe based on the information in the second without using itertuples() as it takes too long.
The first dataframe contains data of special events, lets say an avalanche going down in a ski resort. In the end I want to correlate the mass of the avalanche to available status data from the city, like temperature or amount of snow. All status data from all cities is in one DataFrame. All information on the avalanche is in the other DataFrame, which I would like to extend by the status data IF the status data was collected on the same day in the same city at an earlier time.
Here is the setup, I hope it is easy to understand/paste:
import pandas as pd
import numpy as np
import datetime
status = pd.DataFrame({'location': {0: 'Silvretta', 1: 'Landeck', 2: 'Silvretta', 3: 'Landeck', 4: 'Silvretta', 5: 'Landeck', 6: 'Silvretta', 7: 'Landeck'}, 'date': {0: datetime.date(2022, 3, 30), 1: datetime.date(2022, 3, 30), 2: datetime.date(2022, 3, 30), 3: datetime.date(2022, 3, 30), 4: datetime.date(2022, 3, 31), 5: datetime.date(2022, 3, 31), 6: datetime.date(2022, 3, 31), 7: datetime.date(2022, 3, 31)}, 'time': {0: datetime.time(8, 0), 1: datetime.time(8, 0), 2: datetime.time(20, 0), 3: datetime.time(20, 0), 4: datetime.time(8, 0), 5: datetime.time(8, 0), 6: datetime.time(20, 0), 7: datetime.time(20, 0)}, 'wind_lvl': {0: 8, 1: 5, 2: 10, 3: 7, 4: 8, 5: 10, 6: 2, 7: 1}, 'snow_lvl': {0: 10, 1: 11, 2: 7, 3: 9, 4: 4, 5: 0, 6: 4, 7: 4}, 'number_skiers': {0: 26, 1: 87, 2: 8, 3: 25, 4: 90, 5: 86, 6: 55, 7: 31}})
avalanches = pd.DataFrame({'location': {0: 'Silvretta', 1: 'Landeck', 2: 'Silvretta', 3: 'Silvretta'}, 'date': {0: datetime.date(2022, 3, 30), 1: datetime.date(2022, 3, 30), 2: datetime.date(2022, 3, 31), 3: datetime.date(2022, 3, 31)}, 'time': {0: datetime.time(7, 35), 1: datetime.time(12, 37), 2: datetime.time(12, 42), 3: datetime.time(23, 12)}, 'mass': {0: 10, 1: 15, 2: 8, 3: 7}})
Now with itertuples I can do the following, non-pythonic loop:
wind_lvls = np.full(len(avalanches), np.nan)
snow_lvls = np.full(len(avalanches), np.nan)
number_ski = np.full(len(avalanches), np.nan)
for idx, ava in enumerate(avalanches.itertuples()):
relevant_status = status[(status.date == ava.date) & (status.location == ava.location) & (status.time < ava.time)]
if len(relevant_status) > 0:
wind_lvls[idx] = relevant_status.wind_lvl.iloc[-1]
snow_lvls[idx] = relevant_status.snow_lvl.iloc[-1]
number_ski[idx] = relevant_status.number_skiers.iloc[-1]
avalanches['wind_lvl'] = wind_lvls
avalanches['snow_lvls'] = snow_lvls
avalanches['number_ski'] = number_ski
This gives me the correct table avalanches and I can call avalanches.dropna().corr()['mass'] which is the information that I was after - however, the process takes way too long to iterate in a non-toy dataset.
To cut the loop and all the manual specification of numpy arrays I tried the following:
avalanches['info'] = avalanches.apply(lambda row: status[(status.location == row.location) & (status.date == row.date) & (status.time < row.time)], axis=1)
This at least gives me the relevant information in a cell of the avalanche df, but I still need to get the latest row of the info and unpack it. I cannot use avalanches['info'] = avalanches.apply(lambda row: status[(status.location == row.location) & (status.date == row.date) & (status.time < row.time)][-1], axis=1) because there might be no match at all (which should result in NaNs). Should I just pack the whole function into a lambda function and would that actually speed up the process?
Could you help me specify a function to apply that would be faster than my itertuples() approach? I feel like I am missing something.
Thanks in advance!
Edit: I updated the status DataFrame with an additional row measured at 12:00 here below:
status = pd.DataFrame({'location': {0: 'Silvretta', 1: 'Landeck', 2: 'Silvretta', 3: 'Landeck', 4: 'Silvretta', 5: 'Landeck', 6: 'Silvretta', 7: 'Silvretta', 8: 'Landeck'}, 'date': {0: datetime.date(2022, 3, 30), 1: datetime.date(2022, 3, 30), 2: datetime.date(2022, 3, 30), 3: datetime.date(2022, 3, 30), 4: datetime.date(2022, 3, 31), 5: datetime.date(2022, 3, 31), 6: datetime.date(2022, 3, 31), 7: datetime.date(2022, 3, 31), 8: datetime.date(2022, 3, 31)}, 'time': {0: datetime.time(8, 0), 1: datetime.time(8, 0), 2: datetime.time(20, 0), 3: datetime.time(20, 0), 4: datetime.time(8, 0), 5: datetime.time(8, 0), 6: datetime.time(12, 0), 7: datetime.time(20, 0), 8: datetime.time(20, 0)}, 'wind_lvl': {0: 8, 1: 5, 2: 10, 3: 7, 4: 8, 5: 10, 6: 5, 7: 2, 8: 1}, 'snow_lvl': {0: 10, 1: 11, 2: 7, 3: 9, 4: 4, 5: 0, 6: 2, 7: 4, 8: 4}, 'number_skiers': {0: 26, 1: 87, 2: 8, 3: 25, 4: 90, 5: 86, 6: 62, 7: 55, 8: 31}})
Now row 2 of avalanches after itertuples() is
Silvretta 2022-03-31 12:42:00 8 5.0 2.0 62.0
The answer from #Laurent leads to
Silvretta 2022-03-31 12:42:00 8 8.0 4.0 90.0

Given your edits, here is one way to do it:
avalanches = pd.concat([avalanches, status]).sort_values(
by=["location", "date", "time"]
)
dfs = [
avalanches.loc[avalanches["location"] == location, :]
for location in avalanches["location"].unique()
]
dfs = [
pd.concat(
[
df[["location", "date", "time", "mass"]],
df[["wind_lvl", "snow_lvl", "number_skiers"]].shift(1),
],
axis=1,
)
for df in dfs
]
avalanches = pd.concat(dfs).dropna(subset="mass").sort_values(by=["time"])
So that:
print(avalanches)
# Output
location date time mass wind_lvl snow_lvl number_skiers
0 Silvretta 2022-03-30 07:35:00 10.0 NaN NaN NaN
1 Landeck 2022-03-30 12:37:00 15.0 5.0 11.0 87.0
2 Silvretta 2022-03-31 12:42:00 8.0 5.0 2.0 62.0
3 Silvretta 2022-03-31 23:12:00 7.0 2.0 4.0 55.0

Thanks a lot to #Laurent who showed me to concat the dataframes into one and work on a joint dataframe. In the end I used the following:
avalanches = pd.concat([avalanches, status]).sort_values('time').reset_index(drop=True)
avalanches[["wind_lvl", "snow_lvl", "number_skiers"]] = avalanches.groupby(by=['location', 'date'])[["wind_lvl", "snow_lvl", "number_skiers"]].fillna(method='ffill')
avalanches.dropna(inplace=True)

Pandas Dataframe: search for specific columns, delete rows within the columns that equal a certain value

I'm trying to write a piece of code and I keep getting stuck with an issue of trying to search for a list of columns in a dataframe, which is not allowed because the list is unhashable.
Essentially, I have a sequence: 'KGTLPK'
I want to first locate every instance of 'K' in the sequence: [0,5]
Then I want to search columns 0 and 5 of my DataFrame for a specific value: 80
I want to get a list of rows that have '80' in columns 0 & 5, and delete those rows.
query = 'KGTLPK'
AA = 'K'
x = []
for pos,char in enumerate(query):
if(char == AA):
x.append(pos)
print(x)
#pddf = my DataFrame
df2 = pddf.filter(regex=x)
print(df2)
rows_removal = list(pddf.loc[pddf[df2] == Phosphorylation].index.tolist())
print(rows_removal)
pddf.drop(pddf.index[rows_removal])
My full DataFrame has 8855 rows, and this needs to decrease as improper values are identified. In this example I deleted all rows where column 0 was not equal to 16. I just need an easier way to do this so I don't hardcode everything.
pddf.head(10).to_dict() output, dataFrame unedited:
{0: {0: 16, 1: 16, 2: 16, 3: 16, 4: 16, 5: 16, 6: 16, 7: 16, 8: 16, 9: 16}, 1: {0: 16, 1: 16, 2: 16, 3: 16, 4: 16, 5: 16, 6: 16, 7: 16, 8: 16, 9: 16}, 2: {0: 16, 1: 16, 2: 16, 3: 16, 4: 16, 5: 16, 6: 16, 7: 16, 8: 16, 9: 16}, 3: {0: 16, 1: 44, 2: 80, 3: 42, 4: 71, 5: 28, 6: 14, 7: 28, 8: 42, 9: 81}}
This extends for 8000 rows.
This is as far as I've gotten. My goal for example would be to say something like "for every instance of '42' in column 3, delete that row"
Any help I can get is greatly appreciated!

Cutting SciPy hierarchical dendrogram into clusters on multiple threshold values

I would like to cut my SciPy's dendrogram into a number of clusters on multiple threshold values.
I've tried using fcluster but it can cut only on one threshold value.
(Here is a piece of code which I have taken from another question for example.)
import pandas
data = pandas.DataFrame({
'total_runs': {0: 2.489857755536053, 1: 1.2877651950650333, 2: 0.8898850111727028, 3: 0.77750321282732704, 4: 0.72593099987615461, 5: 0.70064977003207007, 6:0.68217502514600825,7: 0.67963194285399975, 8: 0.64238326692987524, 9:0.6102581538587678, 10: 0.52588765899448564, 11: 0.44813665774322564, 12: 0.30434031343774476, 13: 0.26151929543260161, 14: 0.18623657993534984, 15: 0.17494230269731209,16: 0.14023670906519603, 17: 0.096817318756050832, 18:0.085822227670014059, 19: 0.042178447746868117, 20: -0.073494398270518693,21: -0.13699665903273103, 22: -0.13733324345373216, 23: -0.31112299949731331, 24: -0.42369178918768974, 25: -0.54826542322710636,26: -0.56090603814914863, 27: -0.63252372328438811, 28: -0.68787316140457322,29: -1.1981351436422796, 30: -1.944118415387774,31: -2.1899746357945964, 32: -2.9077222144449961},
'total_salaries': {0: 3.5998991340231234,1: 1.6158435140488829, 2: 0.87501176080187315, 3: 0.57584734201367749, 4: 0.54559862861592978, 5: 0.85178295446270169,6: 0.18345463930386757, 7: 0.81380836410678736, 8: 0.43412670908952178, 9: 0.29560433676606418, 10: 1.0636736398252848, 11: 0.08930130612600648, 12: -0.20839133305170349, 13: 0.33676911316165403, 14: -0.12404710480916628, 15: 0.82454221267393346,16: -0.34510456295395986, 17: -0.17162157282367937, 18: -0.064803261585569982, 19: -0.22807757277294818, 20: -0.61709008778669083,21: -0.42506873158089231, 22: -0.42637946918743924, 23: -0.53516500398181921, 24: -0.68219830809296633, 25: -1.0051418692474947,26: -1.0900316082184143, 27: -0.82421065378673986, 28: 0.095758053930450004, 29: -0.91540963929213015, 30: -1.3296449323844519,31: -1.5512503530547552, 32: -1.6573856443389405}
})
from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import linkage, dendrogram
distanceMatrix = pdist(data)
dend = dendrogram(linkage(distanceMatrix, method='complete'),
color_threshold=4,
leaf_font_size=10,
labels = df.teamID.tolist())
Dendrogram
So for above dendrogram I would like to make a cut at 3 for Green cluster but for Blue & Red cluster cut should be made at 5 (so both of them comes in a single cluster).

The method fcluster can do this with monocrit parameter, which allows you to pinpoint exactly where to cut on the dendrogram. You want to make cuts at positions -1 and -3, where -1 is the top of the tree and -3 is the third node (where blue meets green) counting from the top down. This is how:
Z = linkage(distanceMatrix, method='complete')
monocrit = np.zeros((Z.shape[0], ))
monocrit[[-1, -3]] = 1
fc = fcluster(Z, 0, criterion='monocrit', monocrit=monocrit)
The flat clusters will be formed by performing separation only at nodes with values strictly greater than the threshold (which is 0).
To illustrate this, I first redo the dendrogram with numbered leaves:
dend = dendrogram(Z, color_threshold=4, leaf_font_size=10, labels = range(33))
and then print the flat clusters:
for k in range(1, 4):
print(np.where(fc == k))
They are
(array([30, 31, 32]),)
(array([12, 14, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]),)
(array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 15]),)
So, green is split in two and red and blue are together.

memory management for dictionary in python

I have following code , i don't understand the scenario behind this please any one can explain.
import sys
data={}
print sys.getsizeof(data)
######output is 280
data={ 1:2,2:1,3:2,4:5,5:5,6:6,7:7,8:8,9:9,0:0,11:11,12:12,13:13,14:14,15:15}
print sys.getsizeof(data)
######output is 1816
data={1:2,2:1,3:2,4:5,5:5,6:6,7:7,8:8,9:9,0:0,11:11,12:12,13:13,14:14,15:15,16:16}
print sys.getsizeof(data)
##### output is 1048
if we increase the len of dictionary then it should increase on size in memory but it decreases why ?

getsizeof() calls the object’s __sizeof__ method and adds an additional garbage collector overhead if the object is managed by the garbage collector.
Windows x64 - If did like below:
data={ 1:2,2:1,3:2,4:5,5:5,6:6,7:7,8:8,9:9,0:0,11:11,12:12,13:13,14:14,15:15}
print sys.getsizeof(data)
print data
data[16]=16
print sys.getsizeof(data)
print data
printed:
1808
{0: 0, 1: 2, 2: 1, 3: 2, 4: 5, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15}
1808
{0: 0, 1: 2, 2: 1, 3: 2, 4: 5, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15, 16: 16}
But I indeed noticed same behavior when rewriting data dictionary as you mentioned:
272 #empty data dict
1808 # 15 elements in data dict
1040 # 16 elements in data dict

Taking the mean by one column and then by another in pandas

I have the following dataset:
data = {'VALVE_SCORE': {0: 34.1,1: 41.0,2: 49.7,3: 53.8,4: 35.8,5: 49.2,6: 38.6,7: 51.2,8: 44.8,9: 51.5,10: 41.9,11: 46.0,12: 41.9,13: 51.4,14: 35.0,15: 49.7,16: 41.5,17: 51.5,18: 45.2,19: 53.4,20: 38.1,21: 50.2,22: 25.4,23: 30.0,24: 28.1,25: 49.9,26: 27.5,27: 37.2,28: 27.7,29: 45.7,30: 27.2,31: 30.0,32: 27.9,33: 34.3,34: 29.5,35: 34.5,36: 28.0,37: 33.6,38: 26.8,39: 31.8},
'DAY': {0: 6, 1: 6, 2: 6, 3: 6, 4: 13, 5: 13, 6: 13, 7: 13, 8: 20, 9: 20, 10: 20, 11: 20, 12: 27, 13: 27, 14: 27, 15: 27, 16: 3, 17: 3, 18: 3, 19: 3, 20: 10, 21: 10, 22: 10, 23: 10, 24: 17, 25: 17, 26: 17, 27: 17, 28: 24, 29: 24, 30: 24, 31: 24, 32: 3, 33: 3, 34: 3, 35: 3, 36: 10, 37: 10, 38: 10, 39: 10},
'MONTH': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 2, 17: 2, 18: 2, 19: 2, 20: 2, 21: 2, 22: 2, 23: 2, 24: 2, 25: 2, 26: 2, 27: 2, 28: 2, 29: 2, 30: 2, 31: 2, 32: 3, 33: 3, 34: 3, 35: 3, 36: 3, 37: 3, 38: 3, 39: 3}}
df = pd.DataFrame(data)
First, I would like to take the mean by day and then by month. However, taking the mean by grouping the days results in decimal months. I would like to preserve the months before I do a groupby('MONTH').mean()
In [401]: df.groupby("DAY").mean()
Out[401]:
VALVE_SCORE MONTH
DAY
3 39.7250 2.5
6 44.6500 1.0
10 32.9875 2.5
13 43.7000 1.0
17 35.6750 2.0
20 46.0500 1.0
24 32.6500 2.0
27 44.5000 1.0
I would like the end result to be:
MONTH VALVE_SCORE
1 value
2 value
3 value

Consider with the data you have, you would like to have the daily mean and then the monthly mean. Putting the same in an Excel pivot table will result like this:
Do doing the same in pandas, grouping by months is enough to get the same result:
df.groupby(['MONTH']).mean()
DAY VALVE_SCORE
MONTH
1 16.5 44.7250
2 13.5 38.0375
3 6.5 30.8000
Since the month and day values are numeric, pandas process it, consider 'DAY' and 'MONTH' values are not numeric and are strings, you get this result:
VALVE_SCORE
MONTH
1 44.7250
2 38.0375
3 30.8000
So pandas already computes the daily means and using it computes the monthly means.

Here's a possible solution. Do let me know if there is a more efficient way of doing it.
df = pd.DataFrame(data)
months = list(df['MONTH'].unique())
frames = []
for p in months:
df_part = df[df['MONTH'] == p]
df_part_avg = df_part.groupby("DAY", as_index=False).mean()
df_part_avg = df_part_avg.drop('DAY', axis=1)
frames.append(df_part_avg)
df_months = pd.concat(frames)
df_final = df_months.groupby("MONTH", as_index=False).mean()
And the result is:
In [430]: df_final
Out[430]:
MONTH VALVE_SCORE
0 1 44.7250
1 2 38.0375
2 3 30.8000

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - speed up iteration between nested dictionary - python

Related

Avoiding itertuples and predefined numpy arrays when iterating pandas dataframe quickly

Pandas Dataframe: search for specific columns, delete rows within the columns that equal a certain value

Cutting SciPy hierarchical dendrogram into clusters on multiple threshold values

memory management for dictionary in python

Taking the mean by one column and then by another in pandas

Categories

Resources