Related
I would like to know how to achieve this question. Given a dataframe, I want to create an array getting all values between -1 to 1, just the values, I don't care about the day or index.
Here is the code:
import pandas as pd
import numpy as np
import random
data = [[round(random.uniform(1,100),2) for i in range(7)] for i in range(10)]
header = [['Lunes', 'Martes', 'Miércoles', 'Jueves','Viernes', 'Sábado','Domingo']]
df = pd.DataFrame(data, columns = header)
mean = df.mean()
std = df.std()
df_normalizado = (df-mean)/std
Lunes Martes Miércoles Jueves Viernes Sábado Domingo
0 -0.250799 1.001706 -0.491738 0.444629 -0.296997 -0.670781 -1.554641
1 -0.868792 -0.100689 -0.359056 1.282681 1.352212 1.176829 -1.374482
2 -0.614918 1.187862 1.398010 1.037513 -1.149555 -0.834707 0.143520
3 -0.319758 1.113691 -0.719597 -1.392089 -0.591716 0.943564 -1.163994
4 -0.718137 -1.300041 1.267097 -0.797168 0.053323 1.187264 0.078008
5 -0.883286 -0.821076 -0.671478 1.268079 0.002583 -0.897651 1.096177
6 1.933040 -0.534570 -1.142057 -0.262689 1.417233 0.851335 0.780141
7 -0.433957 -0.575776 1.406855 0.248020 -1.113399 -0.178332 0.497165
8 1.357213 -1.070254 -0.882708 -1.133679 -0.863344 -1.613941 0.491402
9 0.799394 1.099147 0.194671 -0.695298 1.189661 0.036420 1.006704
To clarify:
enter image description here
Thank you. community!
Since just an array is needed, grab values from the DataFrame and use normal boolean indexing:
a = df.values
print(a[(-1 <= a) & (a <= 1)])
Output:
[-0.250799 -0.491738 0.444629 -0.296997 -0.670781 -0.868792 -0.100689
-0.359056 -0.614918 -0.834707 0.14352 -0.319758 -0.719597 -0.591716
0.943564 -0.718137 -0.797168 0.053323 0.078008 -0.883286 -0.821076
-0.671478 0.002583 -0.897651 -0.53457 -0.262689 0.851335 0.780141
-0.433957 -0.575776 0.24802 -0.178332 0.497165 -0.882708 -0.863344
0.491402 0.799394 0.194671 -0.695298 0.03642 ]
Python pandas suggest query function.
I hope this would be helpful to slove your issue
df.query("Lunes >= -1 and Lunes <= 1 and
Martes >= -1 and Martes <= 1 and
Miércoles >= -1 and Miércoles <= 1 and
Jueves >= -1 and Jueves <= 1 and
Viernes >= -1 and Viernes <= 1 and
Sábado >= -1 and Sábado <= 1 and
Domingo >= -1 and Domingo <=1")
I am trying to generate a new column in a pandas dataframe by loop over >100,000 rows and setting the value of the row conditional on an already existing row.
The current dataframe is a dummy but works as an example. My current code is:
df=pd.DataFrame({'IT100':[5,5,-0.001371,0.0002095,-5,0,-5,5,5],
'ET110':[0.008187884,0.008285232,0.00838258,0.008479928,1,1,1,1,1]})
# if charging set to 1, if discharging set to -1.
# if -1 < IT100 < 1 then set CD to previous cells value
# Charging is defined as IT100 > 1 and Discharge is defined as IT100 < -1
def CD(dataFrame):
for x in range(0,len(dataFrame.index)):
current = dataFrame.loc[x,"IT100"]
if x == 0:
if dataFrame.loc[x+5,"IT100"] > -1:
dataFrame.loc[x,"CD"] = 1
else:
dataFrame.loc[x,"CD"] = -1
else:
if current > 1:
dataFrame.loc[x,"CD"] = 1
elif current < -1:
dataFrame.loc[x,"CD"] = -1
else:
dataFrame.loc[x,"CD"] = dataFrame.loc[x-1,"CD"]
Using if/Else loops is extremely slow. I see that people have suggested to use np.select() or pd.apply(), but I do not know if this will work for my example. I need to be able to index the column because one of my conditions is to set the value of the new column to the value of the previous cell in the column of interest.
Thanks for any help!
#Grajdeanu Alex is right, the loop is slowing you down more than whatever you're doing inside of it. With pandas, a loop is usually the slowest choice. Try this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'IT100':[0,-50,-20,-0.5,-0.25,-0.5,-10,5,0.5]})
df['CD'] = np.nan
#lower saturation
df.loc[df['IT100'] < -1,['CD']] = -1
#upper saturation
df.loc[df['IT100'] > 1,['CD']] = 1
#fill forward
df['CD'] = df['CD'].ffill()
# setting the first row equal to the fifth
df.loc[0,['CD']] = df.loc[5,['CD']]
using ffill will use the last valid value to fill in subsequent nan values (-1 < x < 1)
Similar to EMiller's answer, you could also use clip.
import pandas as pd
import numpy as np
df = pd.DataFrame({'IT100':[0,-50,-20,-0.5,-0.25,-0.5,-10,5,0.5]})
df['CD'] = df['IT100'].clip(-1, 1)
df.loc[~df['CD'].isin([-1, 1]), 'CD'] = np.nan
df['CD'] = df['CD'].ffill()
df.loc[0,['CD']] = df.loc[5,['CD']]
As an alternate to #EMiller's answer
In [213]: df = pd.DataFrame({'IT100':[0,-50,-20,-0.5,-0.25,-0.5,-10,5,0.5]})
In [214]: df
Out[214]:
IT100
0 0.00
1 -50.00
2 -20.00
3 -0.50
4 -0.25
5 -0.50
6 -10.00
7 5.00
8 0.50
In [215]: df['CD'] = pd.Series(np.where(df['IT100'].between(-1, 1), np.nan, df['IT100'].clip(-1, 1))).ffill()
In [217]: df.loc[0, 'CD'] = 1 if df.loc[5, 'IT100'] > -1 else -1
In [218]: df
Out[218]:
IT100 CD
0 0.00 1.0
1 -50.00 -1.0
2 -20.00 -1.0
3 -0.50 -1.0
4 -0.25 -1.0
5 -0.50 -1.0
6 -10.00 -1.0
7 5.00 1.0
8 0.50 1.0
I am definitely still learning python and have tried countless approaches, but can't figure this one out.
I have a dataframe with 2 columns, call them A and B. I need to return a df that will sum the row values of each of these two columns independently until a threshold sum of A exceeds some value, for this example let's say 10. So far I am am trying to use iterrows() and can get segment based on if A >= 10, but can't seem to solve summation of rows until the threshold is met. The resultant df must be exhaustive even if the final A values do not meet the conditional threshold - see final row of desired output.
df1 = pd.DataFrame(data = [[20,16],[10,5],[3,2],[1,1],[12,10],[9,7],[6,6],[5,2]],columns=['A','B'])
df1
A B
0 20 16
1 10 5
2 3 2
3 1 1
4 12 10
5 9 7
6 6 6
7 5 2
Desired result:
A B
0 20 16
1 10 5
2 16 13
3 15 13
4 5 2
Thank you in advance, much time spent, and assistance is much appreciated!!!
Cheers
I rarely write long loops for pandas, but I didn't see a way to do this with a pandas method. Try this horrible loop :( :
The variable I created t is essentially checking the cumulative sums to see if > n (which we have set to 10). Then, we decide to use t, the cumulative some or i the value in the dataframe for any given row (j and u are just there in parallel with to the same thing for column B).
There are a few conditions so some elif statements, and there will be different behavior for the last row the way I have set it up, so I had to have some separate logic for that with the last if -- otherwise the last value wasn't getting appended:
import pandas as pd
df1 = pd.DataFrame(data = [[20,16],[10,5],[3,2],[1,1],[12,10],[9,7],[6,6],[5,2]],columns=['A','B'])
df1
a,b = [],[]
t,u,count = 0,0,0
n=10
for (i,j) in zip(df1['A'], df1['B']):
count+=1
if i < n and t >= n:
a.append(t)
b.append(u)
t = i
u = j
elif 0 < t < n:
t += i
u += j
elif i < n and t == 0:
t += i
u += j
else:
t = 0
u = 0
a.append(i)
b.append(j)
if count == len(df1['A']):
if t == i or t == 0:
a.append(i)
b.append(j)
elif t > 0 and t != i:
t += i
u += j
a.append(t)
b.append(u)
df2 = pd.DataFrame({'A' : a, 'B' : b})
df2
Here's one that works that's shorter:
import pandas as pd
df1 = pd.DataFrame(data = [[20,16],[10,5],[3,2],[1,1],[12,10],[9,7],[6,6],[5,2]],columns=['A','B'])
df2 = pd.DataFrame()
index = 0
while index < df1.size/2:
if df1.iloc[index]['A'] >= 10:
a = df1.iloc[index]['A']
b = df1.iloc[index]['B']
temp_df = pd.DataFrame(data=[[a,b]], columns=['A','B'])
df2 = df2.append(temp_df, ignore_index=True)
index += 1
else:
a_sum = 0
b_sum = 0
while a_sum < 10 and index < df1.size/2:
a_sum += df1.iloc[index]['A']
b_sum += df1.iloc[index]['B']
index += 1
if a_sum >= 10:
temp_df = pd.DataFrame(data=[[a_sum,b_sum]], columns=['A','B'])
df2 = df2.append(temp_df, ignore_index=True)
else:
a = df1.iloc[index-1]['A']
b = df1.iloc[index-1]['B']
temp_df = pd.DataFrame(data=[[a,b]], columns=['A','B'])
df2 = df2.append(temp_df, ignore_index=True)
The key is to keep track of where you are in the DataFrame and track the sums. Don't be afraid to use variables.
In Pandas, use iloc to access each row by index. Make sure you don't go out of the DataFrame by checking the size. df.size returns the number of elements, so it will multiply the rows by the columns. This is why I divided the size by the number of columns, to get the actual number of rows.
I'm trying to compare the value of the current index to the value of the next index in my pandas data frame. I'm able to access the value with iloc but when I write an if condition to validate the value. It gives me an error.
Code I tried:
df = pd.DataFrame({'Col1': [2.5, 1.5, 3 , 3 ,4.8 , 4 ]})
trend = list()
for k in range(len(df)):
if df.iloc[k+1] > df.iloc[k]:
trend.append('up')
if df.iloc[k+1] < df.iloc[k]:
trend.append('down')
if df.iloc[k+1] == df.iloc[k]:
trend.append('nochange')
dftrend = pd.DataFrame(trend)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I tried assigning the iloc[k] value to a variable "current" with astype=int. Still I am unable to use the variable "current" in my if condition validation. Appreciate if somebody can help with info on how to resolve it.
You are getting the error becauce
df.iloc[k] gives you a pd.Series.
You can use say df.iloc[k,0] to get the Col1 value
So, what I have done is I have converted that particular column into a list. Instead of working directly from the Series object returned by the dataframe, I prefer converting it to a list or numpy array first and then performing basic functions on it.
I have also given the correct working code below.
import pandas as pd
df = pd.DataFrame({'Col1': [2.5, 1.5, 3 , 3 ,4.8 , 4 ]})
trend = list()
temp=df['Col1'].tolist()
print(temp)
for k in range(len(df)-1):
if temp[k+1] > temp[k]:
trend.append('up')
if temp[k+1] < temp[k]:
trend.append('down')
if temp[k+1] == temp[k]:
trend.append('nochange')
dftrend = pd.DataFrame(trend)
print(trend)
Here is a more pandas-like approach. We can get the difference of two consecutive elements of a series easily via pandas.DataFrame.diff:
import pandas as pd
df = pd.DataFrame({'Col1': [2.5, 1.5, 3 , 3 ,4.8 , 4 ]})
df_diff = df.diff()
Col1
0 NaN
1 -1.0
2 1.5
3 0.0
4 1.8
5 -0.8
Now you can apply a function elementwise that only distiguishes >0, <0 or ==0, using pandas.DataFrame.applymap
def direction(x):
if x > 0:
return 'up'
elif x < 0:
return 'down'
elif x == 0:
return 'nochange'
df_diff.applymap(direction))
Col1
0 None
1 down
2 up
3 nochange
4 up
5 down
Finally, it's a design decision what should happen to the first value of the series. Here the NaN value doesn't fit any case. You can also treat it separately in direction, or omit in in your result by slicing.
Edit: The same as a oneliner:
df.diff().applymap(lambda x: 'up' if x > 0 else ('down' if x < 0 else ('nochange' if x == 0 else None)))
You can use this:
df['trend'] = np.where(df.Col1.shift().isnull(), "N/A", np.where(df.Col1 == df.Col1.shift(), "nochange", np.where(df.Col1 < df.Col1.shift(), "down", "up")))
Col1 trend
0 2.5 N/A
1 1.5 down
2 3.0 up
3 3.0 nochange
4 4.8 up
5 4.0 down
I need to sum column "Runs" when MatchN is x, B is between i and j.
MatchN I B Runs
1000887 1 0.1 1
1000887 1 0.2 3
1000887 1 0.3 0
1000887 1 0.4 2
1000887 1 0.5 1
I tried using for loop but not able to crack it so far. Any suggestions?
You can first use a filter, and then sum op the B column, like:
df[(df['MatchN'] == x) & (i <= df['B']) & (df['B'] <= j)]['Runs'].sum()
# \_________________________ _________________________/ \___ __/\_ __/
# v v v
# filter part column sum part
So the filter part, is the logical and of three conditions:
df['MatchN'] == x;
i <= df['B']; and
df['B'] <= j.
We use the & operator to combine the three filters. Next we select these rows with df[<filter-condition>] (with <filter-condition> our previously discussed filter).
Next we select the Runs column of the filtered rows, and then finally we calculate the .sum() of that column.
You can use query:
x = '1000887'
i = 0.2
j = 0.4
df.query('MatchN == #x and #i <= B <= #j')['Runs'].sum()
Output:
5