I try to analyse wind speed data from a lidar, creating a dataframe in which the columns are the investigated heights and the row is the number of NaNs at that elevation. My script creates the dataframe and names the columns as it required but it doesn't write the number of NaNs in the corresponding cells. Any idea what the problem might be?
df=pd.read_csv(fileApath,delimiter=',',skiprows=1)
heights = ['123','98','68','65','57','48','39','38','29','18','10']
nanvalues_speed=pd.DataFrame()
for i in heights:
nanvalues_speed[i+'m']=pd.notnull(df['Horizontal Wind Speed (m/s) at '+i+'m']).sum()
The function you are looking for is pandas.DataFrame.isna()
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isna.html
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1,2,3,np.nan ,5],
'b': ['a',np.nan ,'c','d','e']})
df.isna().sum()
The pandas.DataFrame.isna() funtion returns a boolean same-sized object indicating if the values in the DataFrame are NA.
Related
I have an Excel file with several columns.
From this columns I want to plot columns which have a name like this:
IVOF_1_H, IVOF_1_L, IVOF_2_H, IVOF_2_L,...those columns will be on y axis. For the x axis the column will always be the same
I do not know how many of those columns I have in the file. I only know that the number is increasing. Is there any possibility to check how many of those IVOF columns I have and plot them.
In general, there is a limitation of those IVOF columns and I don't mind to set up my script in a way that all of those columns got plotted (if they are existing), but then I don't know how to avoid the code to crash if one of those columns is missing.
You can filter your data frame by its column name:
import pandas as pd
df = pd.read_excel('sample.xlsx')
df = df.filter(regex=("IVOF.*"))
#plot the first row
df.iloc[0].plot(kind="bar")
#plot all rows
df.plot(kind="bar")
simple example:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[2,4,4],[4,3,3],[5,9,1]]),columns=['A','B1','B2'])
df = df.filter(regex=("B.*"))
df.plot(kind="bar")
The result:
I have a dataframe like below:
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': [12124,12124,5687,5687,7892],
'A': [np.nan,np.nan,3.05,3.05,np.nan],'B':[1.05,1.05,np.nan,np.nan,np.nan],'C':[np.nan,np.nan,np.nan,np.nan,np.nan],'D':[np.nan,np.nan,np.nan,np.nan,7.09]})
I want to get box plot of columns A, B, C, and D, where the redundant row values in each column needs to be counted once only. How do I accomplish that?
Because panda can only deal with the dataFrame that every column has same length as well as every row has same length. In other words, only frame-shape data could be process. If null values need to be counted only once, it may conflict the principles of "panda" package. Here is my suggestion: you could transform the dataframe into list .
The detailed code of transforming the dataFrame into list
Then you could try to plot the box plot from the list data and index column.
I have a dataframe
import yfinance as yf
import numpy as np
from scipy.signal import argrelextrema
import matplotlib.pyplot as plt
import pandas as pd
company_name = "INFY.NS"
df = yf.Ticker(company_name).history(period='400d', interval='1D')
Now, I have a dataframe as df1. I am doing calculations for get max and min values.
n = 2
df['min'] = df.iloc[argrelextrema(df['Close'].values, np.less_equal,order=n)[0]]['Close']
df['max'] = df.iloc[argrelextrema(df['Close'].values, np.greater_equal,order=n)[0]]['Close']
print()
Thus dataframe looks like this
But, instead of these 2 columns i.e. max and min, I want only one column named MACD and wanted to add values of max and min columns in it.
Thus,
if max is none and min has value, add in MACD column and vice versa
if max and min are both Nan, drop the row.
What is the best way to do this?
I have got the answer to merge the columns and removing Nan columns, posting the code here.
df['total'] = df['min'].combine_first(df['max'])
df = df.dropna(subset=['total'])
I have a pandas dataframe like so:
import pandas as pd
import numpy as np
df = pd.DataFrame([['WY','M',2014,'Seth',5],
['WY','M',2014,'Spencer',5],
['WY','M',2014,'Tyce',5],
['NY','M',2014,'Seth',25],
['MA','M',2014,'Spencer',23]],columns = ['state','sex','year','name','number'])
print df
How do I manipulate the data to get a dataframe like:
df1 = pd.DataFrame([['M',2014,'Seth',30],
['M',2014,'Spencer',28],
['M',2014,'Tyce',5]],
columns = ['sex','year','name','number'])
print df1
This is just part of a very large dataframe, how would I do this for every name for every year?
df[['sex','year','name','number']].groupby(['sex','year','name']).sum().reset_index()
For a brief description of what this does, from left to right:
Select only the columns we care about. We could replace this part with df.drop('state',axis=1)
Perform a groupby on the columns we care about.
Sum the remaining columns (in this case, just number).
Reset the index so that the columns ['sex','year','name'] are no longer a part of the index.
you can use pivot table
df.pivot_table(values = 'number',aggfunc = 'sum',columns = ['sex','year','name']).reset_index().rename(columns={0:'number'})
Group by the columns you want, sum number, and flatten the multi-index:
df.groupby(['sex','year','name'])['number'].sum().reset_index()
In your case the column state is not sum-able, so you can shorten to:
df.groupby(['sex','year','name']).sum().reset_index()
I have two Pandas DataFrames, one where each column is a cumulative distribution (all entries between [0,1] and monotonically increasing) and second with the values associated to each cumulative distribution.
I need to access the values associated to different points in the cumulative distributions (percentiles). For example I could be interested in the percentiles [.1,.9] I'm finding the location of these percentiles in the DataFrame with the associated values by checking where in the first DataFrame I should insert the percentiles. This gives me a 2-d numpy array where each column has the location of the row for that column.
How can I use this array to access the values in the DataFrame? Is there a better way to access the values in one of the DataFrames based on where the percentile is located in the first DataFrame?
import pandas pd
import numpy as np
cdfs = pd.DataFrame([[.1,.2],[.4,.3],[.8,.7],[1.0,1.0]])
df1 = pd.DataFrame([[-10.0,-8.0],[1.4,3.3],[5.8,8.7],[11.0,15.0]])
percentiles = [0.15,0.75]
spots = np.apply_along_axis(np.searchsorted,0,cdfs,percentiles)
This does not work:
df1[spots]
Expected output:
[[1.4 -8.0]
[5.8 15.0]]
This does work, but seems cumbersome:
output = pd.DataFrame(index=percentiles,columns=df1.columns)
for column in range(spots.shape[1]):
output.loc[percentiles,column] = df1.loc[spots[:,column],column].values
try this:
df1.values[spots, [0, 1]]