I want to check if one date is between two other dates (everything in the same row). If this is true I want that a new colum is filled with a sales value of the same table. If not the row shall be dropped.
The code shall iterate over the entire dataframe.
This is my code:
for row in final:
x = 0
if pd.to_datetime(final['start_date'].iloc[x]) < pd.to_datetime(final['purchase_date'].iloc[x]) < pd.to_datetime(final['end_date'].iloc[x]):
final['new_col'].iloc[x] = final['sales'].iloc[x]
else:
final.drop(final.iloc[x])
x = x + 1
print(final['new_col'])
Instead of the values of final[sales] I just get 0 back.
Does anyone know where the mistake is or any other efficient way to tackle this?
The DataFrame looks like this:
I will do something like this:
First, creating the new column:
import numpy as np
final['new_col'] = np.where(pd.to_datetime(final['start_date'])<(pd.to_datetime(final['purchase_date']), final['sales'], np.NaN)
Then, you just drop the Na's:
final.dropna(inplace=True)
Related
I have a dataframe that looks as follow (click on the lick below):
df.head(10)
https://ibb.co/vqmrkXb
What I would like to do is to remove outliers from the target column (occupied_parking_spaces) when the value of the day column is equal to 6 for instance which refers to sunday (df[‘day’] == 6) using the normal distribution 68-95-99.7 rule.
I tried the following code :
df = df.mask((df['occupied_parking_spaces'] - df['occupied_parking_spaces'].mean()).abs() > 2 * df['occupied_parking_spaces'].std()).dropna()
This line of code removes outliers from the whole dataset no matter the independent variables but I only want to remove outliers from the occupied_parking_spacs column where the day value is equal to 6 for exemple.
What I can do is to create a different dataframe for which I will remove outliers:
sunday_df = df.loc[df['day'] == 0]
sunday_df = sunday_df.mask((sunday_df['occupied_parking_spaces'] - sunday_df['occupied_parking_spaces'].mean()).abs() > 2 * sunday_df['occupied_parking_spaces'].std()).dropna()
But by doing this I will get multiple dataframes for everday of the week that I will have to concatenate at the end, and this is something I do not want to do as there must be a way to do this inside the same dataframe.
Could you please help me out?
Having defined some function to remove outliers, you could use np.where to apply it selectively:
import numpy as np
df = np.where(df['day'] == 0,
remove_outliers(df['occupied_parking_spaces']),
df['occupied_parking_spaces']
)
Using python3, I am working with a DataFrame with 2 columns, the first column is made of dates: for example from 2017-12-18 to 2017-12-25 (but it can change based on the user input). And the relative data on the second column.
What I need is to add a third column in the first position of the DataFrame as such:
1;2017-12-18;Data1
2;2017-12-19;Data2
n;last_date;DataN
import numpy as np
def func(df):
count = np.count_nonzero(df['Data_column'])
l = np.array([x for x in range(count)])
df['day_count'][0] = l
return l
But it keeps giving me wrong results, what I am mistaking?
Thanks in advance!
P.s. I used the count_nonzero because sometimes I could have N/A or blank values in the time series.
I have a dataframe df where one column is timestamp and one is A. Column A contains decimals.
I would like to add a new column B and fill it with the current value of A divided by the value of A one minute earlier. That is:
df['B'] = df['A']_current / df['A'] _(current - 1 min)
NOTE: The data does not come in exactly every 1 minute so "the row one minute earlier" means the row whose timestamp is the closest to (current - 1 minute).
Here is how I do it:
First, I use the timestamp as index in order to use get_loc and create a new dataframe new_df starting from 1 minute after df. In this way I'm sure I have all the data when I go look 1 minute earlier within the first minute of data.
new_df = df.loc[df['timestamp'] > df.timestamp[0] + delta] # delta = 1 min timedelta
values = []
for index, row n new_df.iterrows():
v = row.A / df.iloc[df.index.get_loc(row.timestamp-delta,method='nearest')]['A']
values.append[v]
v_ser = pd.Series(values)
new_df['B'] = v_ser.values
I'm afraid this is not that great. It takes a long time for large dataframes. Also, I am not 100% sure the above is completely correct. Sometimes I get this message:
A value is trying to be set on a copy of a slice from a DataFrame. Try
using .loc[row_indexer,col_indexer] = value instead
What is the best / most efficient way to do the task above? Thank you.
PS. If someone can think of a better title please let me know. It took me longer to write the title than the post and I still don't like it.
You could try to use .asof() if the DataFrame has been indexed correctly by the timestamps (if not, use .set_index() first).
Simple example here
import pandas as pd
import numpy as np
n_vals = 50
# Create a DataFrame with random values and 'unusual times'
df = pd.DataFrame(data = np.random.randint(low=1,high=6, size=n_vals),
index=pd.DatetimeIndex(start=pd.Timestamp.now(),
freq='23s', periods=n_vals),
columns=['value'])
# Demonstrate how to use .asof() to get the value that was the 'state' at
# the time 1 min since the index. Note the .values call
df['value_one_min_ago'] = df['value'].asof(df.index - pd.Timedelta('1m')).values
# Note that there will be some NaNs to deal with consider .fillna()
I'm trying to code following logic in pandas, for first three rows of every group i want to create a variable which should have value 1(1st row), 2 (2nd row), 3(3rd row). I'm doing it like below, In the below code I'm not creating a new variable because i don't know how to do that, so I'm replacing the variable that's already present in the data set. Though my code doesn't throw error, it's giving me very strange results.
def func (i):
data.loc[data.groupby('ID').nth(i).index,'date'] = i
func(1)
Any suggestions?
Thanks in Advance.
If you don't have duplicated index, you can create a row id for each group, filter out id which is larger than 3 and then assign it back to the data frame:
data['date'] = (data.groupby('ID').cumcount() + 1)[lambda x: x <= 3]
This gives the first three rows for each ID 1,2,3, rows beyond 3 will have NaN values.
data = pd.DataFrame({"ID":[1,1,1,1,2,2,3,3,3]})
data['date'] = (data.groupby('ID').cumcount() + 1)[lambda x: x <= 3]
data
Background
I deal with a csv datasheet that prints out columns of numbers. I am working on a program that will take the first column, ask a user for a time in float (ie. 45 and a half hours = 45.5) and then subtract that number from the first column. I have been successful in that regard. Now, I need to find the row index of the "zero" time point. I use min to find that index and then call that off of the following column A1. I need to find the reading at Time 0 to then normalize A1 to so that on a graph, at the 0 time point the reading is 1 in column A1 (and eventually all subsequent columns but baby steps for me)
time_zero = float(input("Which time would you like to be set to 0?"))
df['A1']= df['A1']-time_zero
This works fine so far to set the zero time.
zero_location_series = df[df['A1'] == df['A1'].min()]
r1 = zero_location_series[' A1.1']
df[' A1.1'] = df[' A1.1']/r1
Here's where I run into trouble. The first line will correctly identify a series that I can pull off of for all my other columns. Next r1 correctly identifies the proper A1.1 value and this value is a float when I use type(r1).
However when I divide df[' A1.1']/r1 it yields only one correct value and that value is where r1/r1 = 1. All other values come out NaN.
My Questions:
How to divide a column by a float I guess? Why am I getting NaN?
Is there a faster way to do this as I need to do this for 16 columns.(ie 'A2/r2' 'a3/r3' etc.)
Do I need to do inplace = True anywhere to make the operations stick prior to resaving the data? or is that only for adding/deleting rows?
Example
Dataframe that looks like this
!http://i.imgur.com/ObUzY7p.png
zero time sets properly (image not shown)
after dividing the column
!http://i.imgur.com/TpLUiyE.png
This should work:
df['A1.1']=df['A1.1']/df['A1.1'].min()
I think the reason df[' A1.1'] = df[' A1.1']/r1 did not work was because r1 is a series. Try r1? instead of type(r1) and pandas will tell you that r1 is a series, not an individual float number.
To do it in one attempt, you have to iterate over each column, like this:
for c in df:
df[c] = df[c]/df[c].min()
If you want to divide every value in the column by r1 it's best to apply, for example:
import pandas as pd
df = pd.DataFrame([1,2,3,4,5])
# apply an anonymous function to the first column ([0]), divide every value
# in the column by 3
df = df[0].apply(lambda x: x/3.0, 0)
print(df)
So you'd probably want something like this:
df = df["A1.1"].apply(lambda x: x/r1, 0)
This really only answers part 2 of you question. Apply is probably your best bet for running a function on multiple rows and columns quickly. As for why you're getting nans when dividing by a float, is it possible the values in your columns are anything other than floats or integers?