I have a large dataset (6M rows). For a given column - timestamp I want to take the first 11 characters of each element and construct a new column. So far I am doing it using the apply method but it takes a long time.
df_value_dl['time_sec'] = df_value_dl.apply(lambda x: str(x['timestamp'])[0:10], axis=1)
While looking for faster methods I came across numpy arrays
What would be the correct syntax to do this using np arrays. Thanks
Just in case you haven't found an solution yet: This
df_value_dl['time_sec'] = df_value_dl['timestamp'].astype('string').str[:10]
should be faster than apply.
Related
I currently have a pretty large 3D numpy array (atlasarray - 14M elements with type int64) in which I want to create a duplicate array where every element is a float based on a separate dataframe lookup (organfile).
I'm very much a beginner, so I'm sure that there must be a better (quicker) way to do this. Currently, it takes around 90s, which isn't ages but I'm sure can probably be reduced. Most of this code below is taken from hours of Googling, so surely isn't optimised.
import pandas as pd
organfile = pd.read_excel('/media/sf_VMachine_Shared_Path/ValidationData/ICRP110/AF/AF_OrgansSimp.xlsx')
densityarray = atlasarray
densityarray = densityarray.astype(float)
#create an iterable list of elements that can be written over and go for each elements
for idx, x in tqdm(np.ndenumerate(densityarray), total =densityarray.size):
densityarray[idx] = organfile.loc[x,'Density']
All of the elements in the original numpy array are integers which correspond to an organID. I used pandas to read in the key from an excel file and generate a 4-column dataframe, where in this particular case I want to extract the 4th column (which is a float). OrganIDs go up to 142. Apologies for the table format below, I couldn't get it to work so put it in code format instead.
|:OrganID:|:OrganName:|:TissueType:|:Density:|
|:-------:|:---------:|:----------:|:-------:|
|:---0---:|:---Air---:|:----53----:|:-0.001-:|
|:---1---:|:-Adrenal-:|:----43----:|:-1.030-:|
Any recommendations on ways I can speed this up would be gratefully received.
Put the density from the dataframe into a numpy array:
density = np.array(organfile['Density'])
Then run:
density[atlasarray]
Don't use loops, they are slow. The following example with 14M elements takes less than 1 second to run:
density = np.random.random((143))
atlasarray = np.random.randint(0, 142, (1000, 1000, 14))
densityarray = density[atlasarray]
Shape of densityarray:
print(densityarray.shape)
(1000, 1000, 14)
I have a csv dataset with texts. I need to search through them. I couldn't find an easy way to search for a string in a dataset and get the row and column indexes. For example, let's say the dataset is like:
df = pd.DataFrame({"China": ['Xi','Lee','Hung'], "India": ['Roy','Rani','Jay'], "England": ['Tom','Sam','Jack']})
Now let's say I want to find the string 'rani' and know its location. Is there a simple function to do that? Or do I have to loop through everything to find it?
One vectorized (and therefore relatively scalable) solution to this is to leverage numpy.where:
import numpy as np
np.where(df == 'Rani')
This returns two arrays, corresponding to column and row indices:
(array([1]), array([1]))
You can continue to take advantage of vectorized operations, but also write a more complicated filtering function, like so:
np.where(df.applymap(lambda x: "ani" in x))
In other words, "apply to each cell the function that returns True if 'ani' is in the cell", and then conduct the same np.where filtering step.
You can use any function:
def _should_include_cell(cell_contents):
return cell_contents.lower() == "rani" or "Xi" in cell_contents
np.where(df.applymap(_should_include_cell)
Some final notes:
applymap is slower than simple equality checking
if you need this to scale WAY up, consider using dask instead of pandas
Not sure how this will scale but it works
df[df.eq('Rani')].dropna(1, how='all').dropna()
India
1 Rani
I wrote a function in which each cell of a DataFrame is divided by a number saved in another dataframe.
def calculate_dfA(df_t,xout):
df_A = df_t.copy()
vector_x = xout.T
for index_col, column in tqdm(df_A.iteritems()):
for index_row, row in df_A.iterrows():
df_A.iloc[index_row,index_col] = df_A.iloc[index_row,index_col]/vector_x.iloc[0,index_col]
return(df_A)
The DataFrame on which I apply the calculation has a size of 14839 rows x 14839 columns. According to tqdm the processing speed is roughly 4.5s/it. Accordingly, the calculation will require approixmately 50 days which is not feasible for me. Is there a way to speed up my calculation?
You need to vectorize your division:
result = df_A.values/vector_x
This will broadcast along the row dimension and divide along the column dimension, as you seem to ask for.
Compared to your double for-loop, you are taking advantage of contiguity and homogeneity of the data in memory. This allows for a massive speedup.
Edit: Coming back to this answer today, I am spotting that converting to a numpy array first speeds up the computation. Locally I get a 10x speedup for an array of size similar to the one in the question here-above. Have edited my answer.
I'm on mobile now but you should try to avoid every for loop in python - theres always a better way
For one I know you can multiply a pandas column (Series) times a column to get your desired result.
I think to multiply every column with the matching column of another DataFrame you would still need to iterate (but only with one for loop => performance boost)
I would strongly recommend that you temporarily convert to a numpy ndarray and work with these
I'm using pandas to do some conditional filtering based on string matching using the fuzzywuzzy module. I've written some code that works, but is painfully slow and goes against every instinct in my body because I'm using a for loop over a pandas Series.
My issue is that I want to compare array of strings to another, and if a string in one array is similar enough to ANY string in the other array, I want to remove it from the array completely. My current code is this:
from fuzzywuzzy import fuzz
import pandas as pd
for value in new_contacts['StringMatch']: # this is a pandas column in a dataframe
previous_contacts['ratio'] = previous_contacts['StringMatch'].apply(lambda x: fuzz.ratio(x, value))
previous_contacts = previous_contacts[previous_contacts['ratio'] > 97] # fuzz.ratio outputs an int between 0 and 100
previous_contacts.drop('ratio', axis=1, inplace=True)
Does anyone have any suggestions / best practices to make this code faster?
There might be a faster way to do what you are asking. If possible, I'd ask you to reevaluate your need for the fuzzywuzzy package. The edit distance computation is very expensive as it constructs a matrix of size n * m (n and m being the sizes of the two strings) for each pair of strings in your arrays.
There is a SFrame with columns having dict elements.
import graphlab
import numpy as np
a = graphlab.SFrame({'col1':[{'oshan':3,'modi':4},{'ravi':1,'kishan':5}],
'col2':[{'oshan':1,'rawat':2},{'hari':3,'kishan':4}]})
I want to calculate cosine distance between these two columns for each row of the SFrame. Below is the operation using for loop.
dis = np.zeros(len(a),dtype = float)
for i in range(len(a)):
dis[i] = graphlab.distances.cosine(a['col1'][i],a['col2'][i])
a['distance12'] = dis
This is very inefficient and would take hours if the number of rows was large. Could someone please suggest a better approach.
You can usually avoid looping over an SFrame by using the apply function. In your case, it would look like this:
a.apply(lambda row: graphlab.distances.cosine(row['col1'], row['col2']))
That should be significantly faster than looping in Python.