I have a huge database, which I need to change the value of a column according to a certain condition.
In Pandas I execute the following code to accomplish what I want:
df.loc[
(df['ID_CRITERIO_APURACAO'] == TipoDestinatario.RESIDENCIAL.value) &
(df['CODG_GRUPO_TENSAO'] == 8) &
(df['CONSUMO'].between(0, 30)),
'DESCONTO'
] = 35
How can I do something similar in Dask?
Dask doesn't support inplace mutation. Try this:
condition = (df['ID_CRITERIO_APURACAO'] == TipoDestinatario.RESIDENCIAL.value) &
(df['CODG_GRUPO_TENSAO'] == 8) &
(df['CONSUMO'].between(0, 30))
desconto = df.where(condition, 35)
df['DESCONTO'] = desconto
Related
I have a large dataframe (sample). I was filtering the data according to this code:
A = [f"A{i}" for i in range(50)]
B = [f"B{i}" for i in range(50)]
C = [f"C{i}" for i in range(50)]
for i in A:
cond_A = (df[i]>= -0.0423) & (df[i]<=3)
filt_df = df[cond_A]
for i in B:
cond_B = (filt_df[i]>= 15) & (filt_df[i]<=20)
filt_df2 = filt_df[cond_B]
for i in C:
cond_C = (filt_df2[i]>= 15) & (filt_df2[i]<=20)
filt_df3 = filt_df2[cond_B]
When I print filt_df3, I am getting only an empty dataframe - why?
How can I improve the code, other approaches like some advanced techniques?
I am not sure the code above works as outlined in the edit below?
I would like to know how can I change the code, such that it works as outlined in the edit below?
Edit:
I want to remove the rows based on columns (A0 - A49) based on cond_A.
Then filter the dataframe from 1 based on columns (B0 - B49) with cond_B.
Then filter the dataframe from 2 based on columns (C0 - C49) with cond_C.
Thank you very much in advance.
It seems to me that there is an issue with your codes when you are using the iteration to do the filtering. For example, filt_df is being overwritten in every iteration of the first loop. When the loop ends, filt_df only contains the data filtered with the conditions set in the last iteration. Is this what you intend to do?
And if you want to do the filtering efficient, you can try to use pandas.DataFrame.query (see documentation here). For example, if you want to filter out all rows with column B0 to B49 containing values between 0 and 200 inclusive, you can try to use the Python codes below (assuming that you have imported the raw data in the variable df below).
condition_list = [f'B{i} >= 0 & B{i} <= 200' for i in range(50)]
filter_str = ' & '.join(condition_list)
subset_df = df.query(filter_str)
print(subset_df)
Since the column A1 contains only -0.057 which is outside [-0.0423, 3] everything gets filtered out.
Nevertheless, you seem not to take over the filter in every loop as filt_df{1|2|3} is reset.
This should work:
import pandas as pd
A = [f"A{i}" for i in range(50)]
B = [f"B{i}" for i in range(50)]
C = [f"C{i}" for i in range(50)]
filt_df = df.copy()
for i in A:
cond_A = (df[i] >= -0.0423) & (df[i]<=3)
filt_df = filt_df[cond_A]
filt_df2 = filt_df.copy()
for i in B:
cond_B = (filt_df[i]>= 15) & (filt_df[i]<=20)
filt_df2 = filt_df2[cond_B]
filt_df3 = filt_df2.copy()
for i in C:
cond_C = (filt_df2[i]>= 15) & (filt_df2[i]<=20)
filt_df3 = filt_df3[cond_B]
print(filt_df3)
Of course you will find a lot of filter tools in the pandas library that can be applied to multiple columns
For example this:
https://stackoverflow.com/a/39820329/6139079
You can filter by all columns together with DataFrame.all for test if all rows match together:
A = [f"A{i}" for i in range(50)]
cond_A = ((df[A] >= -0.0423) & (df[A]<=3)).all(axis=1)
B = [f"B{i}" for i in range(50)]
cond_B = ((df[B]>= 15) & (df[B]<=20)).all(axis=1)
C = [f"C{i}" for i in range(50)]
cond_C = ((df[C]>= 15) & (df[C]<=20)).all(axis=1)
And last chain all masks by & for bitwise AND:
filt_df = df[cond_A & cond_B & cond_C]
If get empty DataFrame it seems no row satisfy all conditions.
I am trying to drop rows at specific minutes ( 05,10, 20 )
I have datetime as an index
df5['Year'] = df5.index.year
df5['Month'] = df5.index.month
df5['Day']= df5.index.day
df5['Day_of_Week']= df5.index.day_name()
df5['hour']= df5.index.strftime('%H')
df5['Min']= df5.index.strftime('%M')
df5
Then I run below
def clean(df5):
for i in range(len(df5)):
hour = pd.Timestamp(df5.index[i]).hour
minute = pd.Timestamp(df5.index[i]).minute
if df5 = df5[(df5.index.minute ==5) | (df5.index.minute == 10)| (df5.index.minute == 20)]
df.drop(axis=1, index=i, inplace=True)
it returnes invalid syntax error.
Here looping is not necessary, also not recommended.
Use DatetimeIndex.minute with Index.isin and inverted mask by ~ filtering in boolean indexing:
df5 = df5[~df5.index.minute.isin([5, 10, 20])]
For reuse column df5['Min'] use strings values:
df5 = df5[~df5['Min'].isin(['05', '10', '20'])]
All together:
def clean(df5):
return df5[~df5.index.minute.isin([5, 10, 20])]
You can just do it using boolean indexing, assuming that the index is already parsed as datetime.
df5 = df5[~((df5.index.minute == 5) | (df5.index.minute == 10) | (df5.index.minute == 20))]
Or the opposite of the same answer:
df5 = df5[(df5.index.minute != 5) | (df5.index.minute != 10) | (df5.index.minute != 20)]
Generally speaking, the right synthax to combine a logic OR inside an IF statement is the following:
today = 'Saturday'
if today=='Sunday' OR today=='Saturday':
print('Today is off. Rest at home')
In your case, you should probably use something like this:
if df5 == df5[(df5.index.minute ==5)] OR df5[(df5.index.minute ==10)]
......
FINAL NOTE:
You made some mistakes using == and =
In Python (and many other programming languages), a single equal mark = is used to assign a value to a variable, whereas two consecutive equal marks == is used to check whether 2 expressions give the same value .
= is an assignment operator
== is an equality operator
I would like to learn how to use df.loc and for-loop to calculate new columns for the dataframe below
Problem: from df_G, for T = 400, take value of each Go_j as input
Then add new column "G_ads_400" in dataframe df = df['Adsorption_energy_eV'] - Go_h2o
df_G
df
here is my code for each Temperature
Go_co2 = df_G.loc[df_G.index == "400" & df_G.Go_CO2]
Go_o2= df_G.loc[df_G.index == "400" & df_G.Go_O2]
Go_co= df_G.loc[df_G.index == "400" & df_G.Go_CO]
df.loc[df['Adsorbates'] == "CO2", "G_ads_400"] = df.Adsorption_energy_eV-Go_co2
df.loc[df['Adsorbates'] == "CO", "G_ads_400"] = df.Adsorption_energy_eV-Go_co
df.loc[df['Adsorbates'] == "O2", "G_ads_400"] = df.Adsorption_energy_eV-Go_o2
I am not sure why I kept having error and I would like to know how to put it in a for-loop so it looks less messy
I want to impute the missing values for df['box_office_revenue'] with the median specified by df['release_date'] == x and df['genre'] == y .
Here is my median finder function below.
def find_median(df, year, genre, col_year, col_rev):
median = df[(df[col_year] == year) & (df[col_rev].notnull()) & (df[genre] > 0)][col_rev].median()
return median
The median function works. I checked. I did the code below since I was getting some CopyValue error.
pd.options.mode.chained_assignment = None # default='warn'
I then go through the years and genres, col_name = ['is_drama', 'is_horror', etc] .
i = df['release_year'].min()
while (i < df['release_year'].max()):
for genre in col_name:
median = find_median(df, i, genre, 'release_year', 'box_office_revenue')
df[(df['release_year'] == i) & (df[genre] > 0)]['box_office_revenue'].fillna(median, inplace=True)
print(i)
i += 1
However, nothing changed!
len(df['box_office_revenue'].isnull())
The output was 35527. Meaning none of the null values in df['box_office_revenue'] had been filled.
Where did I go wrong?
Here is a quick look at the data: The other columns are just binary variables
You mentioned
I did the code below since I was getting some CopyValue error...
The warning is important. You did not give your data, so I cannot actually check, but the problem is likely due to:
df[(df['release_year'] == i) & (df[genre] > 0)]['box_office_revenue'].fillna(..)
Let's break this down:
First you select some rows with:
df[(df['release_year'] == i) & (df[genre] > 0)]
Then from that, you select a columns with:
...['box_office_revenue']
And now you have a problem...
Why?
The problem is that when you selected some rows (ie: not all), pandas was forced to create a copy of your dataframe. You then select a column of the copy!. Then you fillna() on the copy. Not super useful.
How do I fix it?
Select the column first:
df['box_office_revenue'][(df['release_year'] == i) & (df[genre] > 0)].fillna(..)
By selecting the entire column first, pandas is not forced to make a copy, and thus subsequent operations should work as desired.
This is not elegant, but I think it works. Basically, I calculate the means conditioned on genre and year, and then join the data to a dataframe containing the imputing values. Then, wherever the revenue data is null, replace the null with the imputed value
import pandas as pd
import numpy as np
#Fake Data
rev = np.random.normal(size = 10_000,loc = 20)
rev_ix = np.random.choice(range(rev.size), size = 100 )
rev[rev_ix] = np.NaN
year = np.random.choice(range(1950,2018), replace = True, size = 10_000)
genre = np.random.choice(list('abc'), size = 10_000, replace = True)
df = pd.DataFrame({'rev':rev,'year':year,'genre':genre})
imputing_vals = df.groupby(['year','genre']).mean()
s = df.set_index(['year','genre'])
s.rev.isnull().any() #True
#Creates dataframe with new column containing the means
s = s.join(imputing_vals, rsuffix = '_R')
s.loc[s.rev.isnull(),'rev'] = s.loc[s.rev.isnull(),'rev_R']
new_df = s['rev'].reset_index()
new_df.rev.isnull().any() #False
This URL describing chained assignments seems useful for such a case: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#evaluation-order-matters
As seen in above URL:
Hence, instead of doing (in your 'for' loop):
for genre in col_name:
median = find_median(df, i, genre, 'release_year', 'box_office_revenue')
df[(df['release_year'] == i) & (df[genre] > 0)]['box_office_revenue'].fillna(median, inplace=True)
You can try:
for genre in col_name:
median = find_median(df, i, genre, 'release_year', 'box_office_revenue')
df.loc[(df['release_year'] == i) & (df[genre] > 0) & (df['box_office_revenue'].isnull()), 'box_office_revenue'] = median
What is the best way to convert DataFrame columns into variables. I have a condition for bet placement and I use head(n=1)
back_bf_lay_bq = bb[(bb['bf_back_bq_lay_lose_net'] > 0) & (bb['bq_lay_price'] < 5) & (bb['bq_lay_price'] != 0) & (bb['bf_back_liquid'] > bb['bf_back_stake']) & (bb['bq_lay_liquid'] > bb['bq_lay_horse_win'])].head(n=1)
I would like to convert columns into variables and pass them to API for bet placement. So I convert back_bf_lay_bq to dictionary and extract values
#Bets placements
dd = pd.DataFrame.to_dict(back_bf_lay_bq, orient='list')
#Betdaq bet placement
bq_selection_id = dd['bq_selection_id'][0]
bq_lay_stake = dd['bq_lay_stake'][0]
bq_lay_price = dd['bq_lay_price'][0]
bet_type = 2
reset_count = dd['bq_count_reset'][0]
withdrawal_sequence = dd['bq_withdrawal_sequence'][0]
kill_type = 2
betdaq_request = betdaq_api.PlaceOrdersNoReceipt(bq_selection_id,bq_lay_stake,bq_lay_price,bet_type,reset_count,withdrawal_sequence,kill_type)
I do not think that it is the most efficient way and it brings a bug from time to time
bq_selection_id = dd['bq_selection_id'][0]
IndexError: list index out of range
So can you suggest a better way to get values from DataFrame and pass them to API?
IIUC you could use iloc to get your first row and then slice your dataframe with your columns subset and pass that to your variables. Something like that:
bq_selection_id, bq_lay_stake, bq_lay_price, withdrawal_sequence = back_bf_lay_bq[['bq_selection_id', 'bq_lay_stake', 'bq_lay_price', 'withdrawal_sequence']].iloc[0]