I am trying to use the function getKmers from this post
def getKmers(sequence, size=6):
return [sequence[x:x+size].lower() for x in range(len(sequence) - size + 1)]
human_data['words'] = human_data.apply(lambda x: getKmers(x['sequence']), axis=1)
However my data set has over 50000 observations and when I ran this script my notebook crashed every time. How should I optimize?
One solution that I can think of is to divide the data set into pieces and run this code iteratively. I do not think this will work since apply() really just iterates through each row. I am not sure what is the problem here.
Try to do it like this. You're unnecessarily going over each column by using apply that way, when you just want to use the sequence column.
human_data['words'] = human_data['sequence'].apply(getKmers)
Edit: While this is faster (you're forgoing running the lambda function), your original way was not going through each column, I mixed apply with applymap up.
Related
I have data in a tabular format with id per row. In columns I have set a flag that has one or more categorical values i.e. condition_one, condition_two
I'm generating summary statistics using the below:
function_count_certain_condition = lambda x: x.str.count("condition_two").sum()
function_count_certain_condition.__name__ = 'number_of_two_conditions'
# ---
aggregations = {
'column_one': ['count','first','last','nunique'],
'conditions_column': [function_count_certain_condition]
}
df_aggregate_stats = df.groupby(['id_column']).agg(aggregations)
This works but doesn't seem particularly pythonic or performant. I tried using value_counts() but got a key error
particularly pythonic
Yes, you are using lambda which is stored in variable (whole point of lambda is missing if it is not nameless) and then shove name for it. Just use def to define function, that is replace
function_count_certain_condition = lambda x: x.str.count("condition_two").sum()
function_count_certain_condition.__name__ = 'number_of_two_conditions'
using
def number_of_two_conditions(x):
return x.str.count("condition_two").sum()
performant
Firstly be warned against premature optimization. If that code works fast enough for your code do not try force it to be faster. Regarding that particular function I do not see anything to cause excessive execution time as both substring counting and addition are generally fast operations.
I have 2 Datasets from different sources and I like to filter redundancy. So i have a function called "compare_df" that takes 1 row from each df and compares them, when they match it returns True, else False.
But "compare_df" is more complex because the information are not formatted the same, and I check for time window overlaps so simply checking if elements match is not possible.
Also there are no matching columns in both df.
My current solution is using apply twice like in this code:
result_df = first_df[first_df.apply(lambda x: ~second_df.apply(
compare_df, axis=1,
args=[x, no_end_time, ]).any(), axis=1)]
Is there an easy way to optimize the code, so that it runs faster.
Maybe it is possible to just "break" the second apply function as soon as a "True" value is returned.
Using itterrows should be slower btw. because there shouldn't be as much redundancy, so the benefit of the easy breaking of the loop probably won't outperform the faster apply function from pandas.
(I know numba could help, but it seems too complicated for this simple task)
Thanks for suggestions and other hints!
I'm trying to apply multiple functions to my pandas dataframes, which is being called from other internal scripts. However, when applying them, the code get really long and I think that isn't a good engineering practice in general:
df['data_udo_product_brand_mode'] = df['data_udo_product_brand'].fillna(
'[]').apply(think_preprocessing.create_list).apply(think_math.get_mode).apply(lambda x: x.strip('"').lower())
Is there a more efficient and better way to apply multiple functions to a dataframe/dataframe columns than the above way?
Thanks in advance!
Possibly there may be a way, but one thing can be done here that the lambda function can be separated out so to get rid from the code getting long .For example:
#Lowering character and removing spaces
strip_lower = lambda x: x.strip('"').lower()
df['data_udo_product_brand_mode']=df['data_udo_product_brand'].fillna('[]').apply(think_preprocessing.create_list).apply(think_math.get_mode).apply(strip_lower)
Say, i have some dask dataframe. I'd like to do some operations with it, than save to csv and print its len.
As I understand, the following code will make dask to compute df twice, am I right?
df = dd.read_csv('path/to/file', dtype=some_dtypes)
#some operations...
df.to_csv("path/to/out/*")
print(len(df))
It is possible to avoid computing twice?
upd.
That's what happens when I use solution by #mdurant
but there are really almost 6 times less rows
Yes, you can achieve this. The optional keyword compute= to to_csv to make a lazy version of the write-to-disc process, and df.size, which is like len(), but also lazily computed.
import dask
futs = df.to_csv("path/to/out/*", compute=False)
_, l = dask.compute(futs, df.size)
This will notice the common work required for the writing and length and not have to read the data twice.
I have a Python package installed that displays Google geography info from a search term. I cannot apply the function across all the rows in my data. I can confirm that it works for a single result:
But how can I apply this across every row in the 7th column?
I understand I need a for loop:
But how do I apply my function from the package to each row?
Thanks a lot
You can use the apply function which is faster:
data2['geocode_result'] = data2['ghana_city'].apply(lambda x: gmaps.geocode(x))
Result will be JSON strings in the geocode_result column, so you may have to define some custom function to extract the information you want from the strings. At least apply will be a step towards the right direction.
pd.Series.map is one way to vectorise your algorithm:
data2['geocode_result'] = data2['ghana_city'].map(gmaps.geocode)
This is still not vectorised as gmaps.geocode() will be applied to every element in the series. But, similar to #ScratchNPurr's approach, you should see a improvement.