Can I iterate through my df within a defined function? - python

I have a function that takes two numbers and makes an output. I want to apply that function to a specific column in my df for the first two rows, and then the next two, etc.
function(df['foo'][0:2]) works, and the next group to be passed into the function would be [2:4]. How would I create a loop that would iterate through my entire df? Or am I not thinking about this correctly?

Use a for loop that iterates in steps of 2.
for i in range(0, len(df.index), 2):
function(df['foo'][i:i+2])

Related

How to vectorize this function to iterate over a particular column in dataframe?

for choice in df["Num"].values:
i=0
delta=0.5
z=[x for x in a["num2"] if choice-delta <= x <= choice+delta] #Selecting a subset from list of random numbers that lies within a range
df["New Num"].iloc[i]=random.choice(z) #Selecting a random number from the subset to update the column
i=i+1
Here df is a dataframe in which I want to iterate over a column named "Num". For each element in Num, I want to update it using random element from a specified bound from a dataframe named a.
Note: My dataset contains 150k values
df["Num"]-Which needs to be updated
a["Num2"]-The random sample from which I have to update
Not sure of what you are expecting.
But as you are taking random items from a["num2"] you can first use a.sample(frac=1) to shuffle rows before entering the loop.
Then, in order to change values in df, I would recommend to define a function update() and call df["Num"].apply(update) to reach your result.

How to find a specif string inside a bunch of strings? Both inside different cells

How can I find if one of these words from a column A is inside of one of these columns B and C?
column A) df['all_types'] = 'spray, protetor, toalha, esfoliante'
column B) df['make_sempre'] = 'limpador-facial,esfoliante,hidratante-labial,hidratante,serum'
column C) df['skin_sempre'] = 'corretivo,batom,produtos-sombrancelha,pinceis,mascara-cilios,iluminador,gloss,blush,delineador'
I've done it using a loop inside a loop inside a loop. And it worked.
But with hundreds of thousands of rows, this was impossible.
I've split these words into separated columns, and then applied a loop to compare each column with the others.
I'm using python and pandas
You might want to try using set differences, so each column being a set of words or combine columns B and C to be one set, then get the difference.

How to loop a command in python with a list as variable input?

This is my first post to the coding community, so I hope I get the right level of detail in my request for help!
Background info:
I want to repeat (loop) command in a df using a variable that contains a list of options. While the series 'amenity_options' contains a simple list of specific items (let's say only four amenities as the example below) the df is a large data frame with many other items. My goal is the run the operation below for each item in the 'amenity_option' until the end of the list.
amenity_options = ['cafe','bar','cinema','casino'] # this is a series type with multiple options
df = df[df['amenity'] == amenity_options] # this is my attempt to select the the first value in the series (e.g. cafe) out of dataframe that contains such a column name.
df.to_excel('{}_amenity.xlsx, format('amenity') # wish to save the result (e.g. cafe_amenity) as a separate file.
Desired result:I wish to loop step one and two for each and every item available in the list (e.g. cafe, bar, cinema...). So that I will have separate excel files in the end. Any thoughts?
What #Rakesh suggested is correct, you probably just need one more step.
df = df[df['amenity'].isin(amenity_options)]
for key, g in df.groupby('amenity'):
g.to_excel('{}_amenity.xlsx'.format(key))
After you call groupby() of your df, you will get 4 groups so that you can directly loop on them.
The key is the group key, which are cafe, bar and etc. and the g is the sub-dataframe that specifically filtered by that key.
Seems like you just need a simple for loop:
for amenity in amenity_options:
df[df['amenity'] == amenity].to_excel(f"{amenity}_amenity.xlsx")

Python Dataframe for loop syntax

I'm having trouble in correctly executing a for loop through my dataframe in python.
Basically, for every row in the dataframe (df_weather), the code should select one value each from the column no. 13 and 14 and execute a function which is defined earlier in the code. Eventually, I require the calculated value in each row to be summed to give me one final answer.
The error being returned is as follows: "string indices must be integers"
Request anyone to help me through this step. The code for the same is provided below.
Thanks!
stress_rate = 0
for i in df_weather:
b = GetStressDampHeatParameterized(i[:,13], i[:,14])
stress_rate = b + stress_rate
print(stress_rate)
This can be solved in a single line:
print sum(df.apply(lambda row: func(row[14], row[15]), axis=1))
Where func is your desired function and axis=1 ensures that the function is applied on each row as opposed to each column (which is the default).
My solution first creates a temporary series (picture: an unattached column) that is constructed by applying a function to each row in turn. The function that is actually being applied is an anonymous function indicated by the keyword lambda, which takes a single input row and which is fed a single row at a time from the apply method. That anonymous function simply calls your function func and passes the two column values in the row.
A Series can be summed using the sum function.
Note the indexing of the columns starts at 0.
Also note, saying for x in df: will iterate over the columns.
your number one problem is the following line:
for i in df_weather: This line is actually yielding you the column titles and not the rows themselves. What you're looking for is actually the following:
for i in df_weather.values():. The values will return a numpy array that you could itterate. The problem though is that the variable i will be a single row in the matrix now.

Python Pandas: .apply taking forever?

I have a DataFrame 'clicks' created by parsing CSV of size 1.4G. I'm trying to create a new column 'bought' using apply function.
clicks['bought'] = clicks['session'].apply(getBoughtItemIDs)
In getBoughtItemIDs, I'm checking if 'buys' dataframe has values I want, and if so, return a string concatenating them. The first line in getBoughtItemIDs is taking forever. What are the ways to make it faster?
def getBoughtItemIDs(val):
boughtSessions = buys[buys['session'] == val].values
output = ''
for row in boughtSessions:
output += str(row[1]) + ","
return output
There are a couple of things that make this code run slowly.
apply is essentially just syntactic sugar for a for loop over the rows of a column. There's also an explicit for loop over a NumPy array in your function (the for row in boughtSessions part). Looping in this (non-vectorised) way is best avoided whenever possible as it impacts performance heavily.
buys[buys['session'] == val].values is looking up val across an entire column for each row of clicks, then returning a sub-DataFrame and then creating a new NumPy array. Repeatedly looking for values in this way is expensive (O(n) complexity each lookup). Creating new arrays is going to be expensive since memory has to be allocated and the data copied across each time.
If I understand what you're trying to do, you could try the following approach to get your new column.
First use groupby to group the rows of buys by the values in 'session'. apply is used to join up the strings for each value:
boughtSessions = buys.groupby('session')[col_to_join].apply(lambda x: ','.join(x))
where col_to_join is the column from buys which contains the values you want to join together into a string.
groupby means that only one pass through the DataFrame is needed and is pretty well-optimised in Pandas. The use of apply to join the strings is unavoidable here, but only one pass through the grouped values is needed.
boughtSessions is now a Series of strings indexed by the unique values in the 'session' column. This is useful because lookups to Pandas indexes are O(1) in complexity.
To match each string in boughtSessions to the approach value in clicks['session'] you can use map. Unlike apply, map is fully vectorised and should be very fast:
clicks['bought'] = clicks['session'].map(boughtSessions)

Categories

Resources