Merging of values in the columns of a dataframe - python

I have a dataframe
import yfinance as yf
import numpy as np
from scipy.signal import argrelextrema
import matplotlib.pyplot as plt
import pandas as pd
company_name = "INFY.NS"
df = yf.Ticker(company_name).history(period='400d', interval='1D')
Now, I have a dataframe as df1. I am doing calculations for get max and min values.
n = 2
df['min'] = df.iloc[argrelextrema(df['Close'].values, np.less_equal,order=n)[0]]['Close']
df['max'] = df.iloc[argrelextrema(df['Close'].values, np.greater_equal,order=n)[0]]['Close']
print()
Thus dataframe looks like this
But, instead of these 2 columns i.e. max and min, I want only one column named MACD and wanted to add values of max and min columns in it.
Thus,
if max is none and min has value, add in MACD column and vice versa
if max and min are both Nan, drop the row.
What is the best way to do this?

I have got the answer to merge the columns and removing Nan columns, posting the code here.
df['total'] = df['min'].combine_first(df['max'])
df = df.dropna(subset=['total'])

Related

Plotting various amount of columns in Excel

I have an Excel file with several columns.
From this columns I want to plot columns which have a name like this:
IVOF_1_H, IVOF_1_L, IVOF_2_H, IVOF_2_L,...those columns will be on y axis. For the x axis the column will always be the same
I do not know how many of those columns I have in the file. I only know that the number is increasing. Is there any possibility to check how many of those IVOF columns I have and plot them.
In general, there is a limitation of those IVOF columns and I don't mind to set up my script in a way that all of those columns got plotted (if they are existing), but then I don't know how to avoid the code to crash if one of those columns is missing.
You can filter your data frame by its column name:
import pandas as pd
df = pd.read_excel('sample.xlsx')
df = df.filter(regex=("IVOF.*"))
#plot the first row
df.iloc[0].plot(kind="bar")
#plot all rows
df.plot(kind="bar")
simple example:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[2,4,4],[4,3,3],[5,9,1]]),columns=['A','B1','B2'])
df = df.filter(regex=("B.*"))
df.plot(kind="bar")
The result:

Counting NaN values in a dataframe

I try to analyse wind speed data from a lidar, creating a dataframe in which the columns are the investigated heights and the row is the number of NaNs at that elevation. My script creates the dataframe and names the columns as it required but it doesn't write the number of NaNs in the corresponding cells. Any idea what the problem might be?
df=pd.read_csv(fileApath,delimiter=',',skiprows=1)
heights = ['123','98','68','65','57','48','39','38','29','18','10']
nanvalues_speed=pd.DataFrame()
for i in heights:
nanvalues_speed[i+'m']=pd.notnull(df['Horizontal Wind Speed (m/s) at '+i+'m']).sum()
The function you are looking for is pandas.DataFrame.isna()
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isna.html
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1,2,3,np.nan ,5],
'b': ['a',np.nan ,'c','d','e']})
df.isna().sum()
The pandas.DataFrame.isna() funtion returns a boolean same-sized object indicating if the values in the DataFrame are NA.

Find all min and max values in Column Python Panda Dataframe and save it in a new Dataframe

I want to find all the local min and maxima values in a column and save the whole row in a new dataframe.
See the example code below. I know we have groupy and likes.
How do I do it in a proper way and create the cycle, which should increase by 1? Lastly only take the time of the minimum and they save it.
import pandas as pd
import numpy as np
l = list(np.linspace(0,10,12))
data = [('time',l),
('A',[0,5,0.6,-4.8,-0.3,4.9,0.2,-4.7,0.5,5,0.1,-4.6]),
('B',[ 0,300,20,-280,-25,290,30,-270,40,300,-10,-260]),
]
df = pd.DataFrame.from_dict(dict(data))
print(df)
data_1 = [('cylce',[1,2,3]),
('delta_time',[2.727273,6.363636 ,10.000000]),
('A_max',[5,4.9,5]),
('A_min',[-4.8,-4.7,-4.6]),
('B_min',[-280,-270,-260]),
('B_max',[300,290,300]),
]
df_1 = pd.DataFrame.from_dict(dict(data_1))
print(df_1)
Any help is much appreciated.

Pythonic way of calculating difference between nth and n-1th value in a large dataframe using Pandas?

Let's say I have a 100x100 pandas dataframe, consisting entirely of numerical values.
What I want to do is get the difference in each column for the nth row and n-1th row:
Let's say the first column has values (1,2,3,4.....100) what I would want is the output (1,1,1,1,1,1,1.....1) it would subtract the first row from the second row, the second row from the third etc....for each column.
I have done it using a for-loop where it loops through each column, then each row. But I'm wondering if there's a more elegant solution
This is what I figure will work, haven't actually had a chance to try it yet for reasons....
outputframe = pd.DataFrame(data=0, index = list(range(1,99)), column = list(range(1,100))
For i in range(0,100):
For x in range(1,100):
outputframe.iloc[x,i]= df.iloc[x,i]-df[x-1,i]
I believe this will give me the correct results, however, I'm wondering if there's possibly a more elegant solution
the key here is the pandas shift(n) method, which allows you to shift the index by n rows.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(100, 100)))
df_new = df.shift(-1) - df
Like #ALollz says .diff() will work fine and fast here.
First row will get NaNs so I'm reassigning the first row again.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(100, 100)))
df_new = df.diff()
df_new.iloc[0] = df.iloc[0]
Original dataframe
After .diff() (NaN on first row)
After df_new.iloc[0] = df.iloc[0]

Combing pandas dataframe values based on other column values

I have a pandas dataframe like so:
import pandas as pd
import numpy as np
df = pd.DataFrame([['WY','M',2014,'Seth',5],
['WY','M',2014,'Spencer',5],
['WY','M',2014,'Tyce',5],
['NY','M',2014,'Seth',25],
['MA','M',2014,'Spencer',23]],columns = ['state','sex','year','name','number'])
print df
How do I manipulate the data to get a dataframe like:
df1 = pd.DataFrame([['M',2014,'Seth',30],
['M',2014,'Spencer',28],
['M',2014,'Tyce',5]],
columns = ['sex','year','name','number'])
print df1
This is just part of a very large dataframe, how would I do this for every name for every year?
df[['sex','year','name','number']].groupby(['sex','year','name']).sum().reset_index()
For a brief description of what this does, from left to right:
Select only the columns we care about. We could replace this part with df.drop('state',axis=1)
Perform a groupby on the columns we care about.
Sum the remaining columns (in this case, just number).
Reset the index so that the columns ['sex','year','name'] are no longer a part of the index.
you can use pivot table
df.pivot_table(values = 'number',aggfunc = 'sum',columns = ['sex','year','name']).reset_index().rename(columns={0:'number'})
Group by the columns you want, sum number, and flatten the multi-index:
df.groupby(['sex','year','name'])['number'].sum().reset_index()
In your case the column state is not sum-able, so you can shorten to:
df.groupby(['sex','year','name']).sum().reset_index()

Categories

Resources