I'm using nested loops to add new columns with dynamic names based on the dataset columns (col) and columns that drops one col (I called it interact col). It works well for small datasets, but it becomes very slow if I have datasets with a very high amount of features. Any tips to simplify the process to make it faster?
import numpy as np
import pandas as pd
X = pd.read_csv('water_potability.csv')
X = X.drop(columns='Unnamed: 0')
X_columns = np.array(X.columns)
fi_df = X.copy()
done_list = []
for col in X_columns:
interact_col = X.drop(columns = col).columns
for int_col in interact_col:
fi_df['({})_minus_({})'.format(col, int_col)] = X[col]-X[int_col]
fi_df['({})_div_({})'.format(col, int_col)] = X[col]/X[int_col]
if int_col not in done_list:
fi_df['({})_add_({})'.format(col, int_col)] = X[col]+X[int_col]
fi_df['({})_multi_({})'.format(col, int_col)] = X[col]*X[int_col]
done_list.append(col)
Related
I'm making a reverse denoisng autoencoder and I have a dataset but it's all lowercased, but I want 80% of the rows the source entry to be capitalized and only 60% of the target entries to be capitalized. I wrote this
import pandas as pd
import torch
df = pd.read_csv('Data/fb_moe.csv')
for i in range(len(df)):
sample = int(torch.distributions.Bernoulli(torch.FloatTensor([.8])).sample())
if sample == 1:
df.iloc[i].y = str(df.iloc[i].y).capitalize()
sample_1 = int(torch.distributions.Bernoulli(torch.FloatTensor([.6])).sample())
if sample_1 == 1:
df.iloc[i].x = str(df.iloc[i].x).capitalize()
df.to_csv('Data/fb_moe2.csv')
But this is pretty slow cause my csv is like 8 million rows is there a faster way to do this?
Part of the Dataframe
x,y
jon,jun
an,jun
ju,jun
jin,jun
nun,jun
un,jun
jon,jun
jin,jun
nen,jun
ju,jun
jn,jun
jul,jun
jen,jun
hun,jun
ju,jun
hun,jun
hun,jun
jon,jun
jin,jun
un,jun
eun,jun
jhn,jun
Try adding some boolean mask and some apply functions, pandas does not behave quickly in for loops
n = len(df)
source = np.random.binomial(1, p=.8, size=n) == 1
target = source.copy()
total_source_true = np.sum(source)
target[source] = np.random.binomial(1, p=.6, size=total_source_true) == 1
df.loc[source, 'x'] = df.loc[source, 'x'].str.capitalize()
df.loc[target, 'y'] = df.loc[source, 'y'].str.capitalize()
I am working on a data analysis and I have to generate Histograms. My code has more than 7 nested for-loops. Each nested loop filters the data frame by a unique value from the category to form a new data frame of sub categories and then splitting further like previous. Each day has around 400,000 records. And I have to process last 30 days record. The result is to produce histograms for the values(only one numerical column) of the last un-splittable category. How do I reduce complexity? Any alternate methods ?
for customer in data_frame['MasterCustomerID'].unique():
df_customer = data_frame.loc[data_frame['MasterCustomerID'] == customer]
for service in df_customer['Service'].unique():
df_service = df_customer.loc[df_customer['Service'] == service]
for source in df_service['Source'].unique():
df_source = df_service.loc[df_service['Source'] == source]
for subcomponent in df_source['SubComponentType'].unique():
df_subcomponenttypes = df_source.loc[df_source['SubComponentType'] == subcomponent]
for kpi in df_subcomponenttypes['KPI'].unique():
df_kpi = df_subcomponenttypes.loc[df_subcomponenttypes['KPI'] == kpi]
for device in df_kpi['Device_Type'].unique():
df_device_type = df_kpi.loc[df_kpi['Device_Type'] == device]
for access in df_device_type['Access_type'].unique():
df_access_type = df_device_type.loc[df_device_type['Access_type'] == access]
df_access_type['Day'] = ifweekday(df_access_type['PerformanceTimeStamp'])
You can use pandas.groupby to find unique combinations of different levels of the columns (see here and here) and then loop over the dataframe grouped by each combination. There are ~4000 combinations so be careful when uncommenting the histogram code below.
import string
import numpy as np, pandas as pd
from matplotlib import pyplot as plt
np.random.seed(100)
# Generate 400,000 records (400 obs for 1000 individuals in 6 columns)
NIDS = 1000; NOBS = 400; NCOLS = 6
df = pd.DataFrame(np.random.randint(0, 4, size = (NIDS*NOBS, NCOLS)))
mapper = dict(zip(range(26), list(string.ascii_lowercase)))
df.replace(mapper, inplace = True)
cols = ['Service', 'Source', 'SubComponentType', \
'KPI', 'Device_Type', 'Access_type']
df.columns = cols
# Generate IDs for individuals
df['MasterCustomerID'] = np.repeat(range(NIDS), NOBS)
# Generate values of interest (to be plotted)
df['value2plot'] = np.random.rand(NIDS*NOBS)
# View the counts for each unique combination of column levels
df.groupby(cols).size()
# Do something with the different subsets (such as make histograms)
for levels, group in df.groupby(cols):
print(levels)
# fig, ax = plt.subplots()
# ax.hist(group['value2plot'])
# ax.set_title(", ".join(levels))
# plt.savefig("hist_" + "_".join(levels) + ".png")
# plt.close()
I have a pandas dataframe which has a column 'INTENSITY' and a numpy array of same length containing the error for each intensity. I would like to generate columns with randomly generated numbers in the error range.
So far I use two nested for loops to create the new columns but I feel like this is inefficient:
theor_err = [ sqrt(abs(x)) for x in theor_df[str(INTENSITY)] ]
theor_err = np.asarray(theor_err)
for nr_sample in range(2):
sample = np.zeros(len(theor_df[str(INTENSITY)]))
for i, error in enumerate(theor_err):
sample[i] = theor_df[str(INTENSITY)][i] + random.uniform(-error, error)
theor_df['gen_{}'.format(nr_sample)] = Series(sample, index=theor_df.index)
theor_df.head()
Is there a more efficient way of approaching a problem like this?
Numpy can handle arrays for you. So, you can do it like this:
import pandas as pd
import numpy as np
a=pd.DataFrame([10,20,15,30],columns=['INTENSITY'])
a['theor_err']=np.sqrt(np.abs(a.INTENSITY))
a['sample']=np.random.uniform(-a['theor_err'],a['theor_err'])
Suppose you want to generate 6 samples. You can try to code below. You can tune the number of samples you want by setting the value k.
df = pd.DataFrame([[1],[2],[3],[4],[-5]], columns=["intensity"])
k = 6
sample_names = ["sample" + str(i+1) for i in range(k)]
df["err"] = np.sqrt(np.abs((df["intensity"])))
df[sample_names] = pd.DataFrame(
df["err"].map(lambda x: np.random.uniform(-x, x, k)).values.tolist())
df.loc[:,sample_names] = df.loc[:,sample_names].add(df.intensity, axis=0)
I want to make a For Loop given below, faster in python.
import pandas as pd
import numpy as np
import scipy
np.random.seed(1)
xl = pd.DataFrame({'Concat' : np.arange(101,999), 'ships_x' : np.random.randint(1001,3000,size=898)})
yl = pd.DataFrame({'PickDate' : np.random.randint(1,8,size=10000),'Concat' : np.random.randint(101,999,size=10000), 'ships_x' : np.random.randint(101,300,size=10000), 'ships_y' : np.random.randint(1001,3000,size=10000)})
tempno = [np.random.randint(1,100,size=5)]
k=1
p = pd.DataFrame(0,index=np.arange(len(xl)),columns=['temp','cv']).astype(object)
for ib in [xb for xb in range(0,len(xl))]:
tempno1 = np.append(tempno,ib)
temp = list(set(tempno1))
temptab = yl[yl['Concat'].isin(np.array(xl['Concat'][tempno1]))].groupby('PickDate')['ships_x','ships_y'].sum().reset_index()
temptab['contri'] = temptab['ships_x']/temptab['ships_y']
p.ix[k-1,'cv'] = 1 if math.isnan(scipy.stats.variation(temptab['contri'])) else scipy.stats.variation(temptab['contri'])
p.ix[k-1,'temp'] = temp
k = k+1
where,
xl, yl - two data frames I am working on with columns like Concat, x_ships and y_ships.
tempno - a initial list of indices of xl dataframe, referring to a list of 'Concat' values.
So, in for loop we add one extra index to tempno in each iteration and then subset 'yl' dataframe based on 'Concat' values matching with those of 'xl' dataframe. Then, we find "coefficient of variation"(taken from scipy lib) and make note in new dataframe 'p'.
The problem is it is taking too much time as number of iterations of for loop varies in thousands. The 'group_by' line is taking maximum time. I have tried and made a few changes, now the code look likes below, changes made mentioned in comments. There is a slight improvement but this doesn't solve my purpose. Please suggest the fastest way possible to implement this. Many thanks.
# Getting all tempno1 into a list with one step
tempno1 = [np.append(tempno,ib) for ib in [xb for xb in range(0,len(xl))]]
temp = [list(set(tempk)) for tempk in tempno1]
# Taking only needed columns from x and y dfs
xtemp = xl[['Concat']]
ytemp = yl[['Concat','ships_x','ships_y','PickDate']]
#Shortlisting y df and groupby in two diff steps
ytemp = [ytemp[ytemp['Concat'].isin(np.array(xtemp['Concat'][tempnokk]))] for tempnokk in tempno1]
temptab = [ytempk.groupby('PickDate')['ships_x','ships_y'].sum().reset_index() for ytempk in ytemp]
tempkcontri = [tempk['ships_x']/tempk['ships_y'] for tempk in temptab]
tempkcontri = [pd.DataFrame(tempkcontri[i],columns=['contri']) for i in range(0,len(tempkcontri))]
temptab = [temptab[i].join(tempkcontri[i]) for i in range(0,len(temptab))]
pcv = [1 if math.isnan(scipy.stats.variation(temptabkk['contri'])) else scipy.stats.variation(temptabkk['contri']) for temptabkk in temptab]
p = pd.DataFrame({'temp' : temp,'cv': pcv})
I need to create a pivot table of 2000 columns by around 30-50 million rows from a dataset of around 60 million rows. I've tried pivoting in chunks of 100,000 rows, and that works, but when I try to recombine the DataFrames by doing a .append() followed by .groupby('someKey').sum(), all my memory is taken up and python eventually crashes.
How can I do a pivot on data this large with a limited ammount of RAM?
EDIT: adding sample code
The following code includes various test outputs along the way, but the last print is what we're really interested in. Note that if we change segMax to 3, instead of 4, the code will produce a false positive for correct output. The main issue is that if a shipmentid entry is not in each and every chunk that sum(wawa) looks at, it doesn't show up in the output.
import pandas as pd
import numpy as np
import random
from pandas.io.pytables import *
import os
pd.set_option('io.hdf.default_format','table')
# create a small dataframe to simulate the real data.
def loadFrame():
frame = pd.DataFrame()
frame['shipmentid']=[1,2,3,1,2,3,1,2,3] #evenly distributing shipmentid values for testing purposes
frame['qty']= np.random.randint(1,5,9) #random quantity is ok for this test
frame['catid'] = np.random.randint(1,5,9) #random category is ok for this test
return frame
def pivotSegment(segmentNumber,passedFrame):
segmentSize = 3 #take 3 rows at a time
frame = passedFrame[(segmentNumber*segmentSize):(segmentNumber*segmentSize + segmentSize)] #slice the input DF
# ensure that all chunks are identically formatted after the pivot by appending a dummy DF with all possible category values
span = pd.DataFrame()
span['catid'] = range(1,5+1)
span['shipmentid']=1
span['qty']=0
frame = frame.append(span)
return frame.pivot_table(['qty'],index=['shipmentid'],columns='catid', \
aggfunc='sum',fill_value=0).reset_index()
def createStore():
store = pd.HDFStore('testdata.h5')
return store
segMin = 0
segMax = 4
store = createStore()
frame = loadFrame()
print('Printing Frame')
print(frame)
print(frame.info())
for i in range(segMin,segMax):
segment = pivotSegment(i,frame)
store.append('data',frame[(i*3):(i*3 + 3)])
store.append('pivotedData',segment)
print('\nPrinting Store')
print(store)
print('\nPrinting Store: data')
print(store['data'])
print('\nPrinting Store: pivotedData')
print(store['pivotedData'])
print('**************')
print(store['pivotedData'].set_index('shipmentid').groupby('shipmentid',level=0).sum())
print('**************')
print('$$$')
for df in store.select('pivotedData',chunksize=3):
print(df.set_index('shipmentid').groupby('shipmentid',level=0).sum())
print('$$$')
store['pivotedAndSummed'] = sum((df.set_index('shipmentid').groupby('shipmentid',level=0).sum() for df in store.select('pivotedData',chunksize=3)))
print('\nPrinting Store: pivotedAndSummed')
print(store['pivotedAndSummed'])
store.close()
os.remove('testdata.h5')
print('closed')
You could do the appending with HDF5/pytables. This keeps it out of RAM.
Use the table format:
store = pd.HDFStore('store.h5')
for ...:
...
chunk # the chunk of the DataFrame (which you want to append)
store.append('df', chunk)
Now you can read it in as a DataFrame in one go (assuming this DataFrame can fit in memory!):
df = store['df']
You can also query, to get only subsections of the DataFrame.
Aside: You should also buy more RAM, it's cheap.
Edit: you can groupby/sum from the store iteratively since this "map-reduces" over the chunks:
# note: this doesn't work, see below
sum(df.groupby().sum() for df in store.select('df', chunksize=50000))
# equivalent to (but doesn't read in the entire frame)
store['df'].groupby().sum()
Edit2: Using sum as above doesn't actually work in pandas 0.16 (I thought it did in 0.15.2), instead you can use reduce with add:
reduce(lambda x, y: x.add(y, fill_value=0),
(df.groupby().sum() for df in store.select('df', chunksize=50000)))
In python 3 you must import reduce from functools.
Perhaps it's more pythonic/readable to write this as:
chunks = (df.groupby().sum() for df in store.select('df', chunksize=50000))
res = next(chunks) # will raise if there are no chunks!
for c in chunks:
res = res.add(c, fill_value=0)
If performance is poor / if there are a large number of new groups then it may be preferable to start the res as zero of the correct size (by getting the unique group keys e.g. by looping through the chunks), and then add in place.