I have two Pandas DataFrames, one where each column is a cumulative distribution (all entries between [0,1] and monotonically increasing) and second with the values associated to each cumulative distribution.
I need to access the values associated to different points in the cumulative distributions (percentiles). For example I could be interested in the percentiles [.1,.9] I'm finding the location of these percentiles in the DataFrame with the associated values by checking where in the first DataFrame I should insert the percentiles. This gives me a 2-d numpy array where each column has the location of the row for that column.
How can I use this array to access the values in the DataFrame? Is there a better way to access the values in one of the DataFrames based on where the percentile is located in the first DataFrame?
import pandas pd
import numpy as np
cdfs = pd.DataFrame([[.1,.2],[.4,.3],[.8,.7],[1.0,1.0]])
df1 = pd.DataFrame([[-10.0,-8.0],[1.4,3.3],[5.8,8.7],[11.0,15.0]])
percentiles = [0.15,0.75]
spots = np.apply_along_axis(np.searchsorted,0,cdfs,percentiles)
This does not work:
df1[spots]
Expected output:
[[1.4 -8.0]
[5.8 15.0]]
This does work, but seems cumbersome:
output = pd.DataFrame(index=percentiles,columns=df1.columns)
for column in range(spots.shape[1]):
output.loc[percentiles,column] = df1.loc[spots[:,column],column].values
try this:
df1.values[spots, [0, 1]]
Related
I have a dataframe like below:
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': [12124,12124,5687,5687,7892],
'A': [np.nan,np.nan,3.05,3.05,np.nan],'B':[1.05,1.05,np.nan,np.nan,np.nan],'C':[np.nan,np.nan,np.nan,np.nan,np.nan],'D':[np.nan,np.nan,np.nan,np.nan,7.09]})
I want to get box plot of columns A, B, C, and D, where the redundant row values in each column needs to be counted once only. How do I accomplish that?
Because panda can only deal with the dataFrame that every column has same length as well as every row has same length. In other words, only frame-shape data could be process. If null values need to be counted only once, it may conflict the principles of "panda" package. Here is my suggestion: you could transform the dataframe into list .
The detailed code of transforming the dataFrame into list
Then you could try to plot the box plot from the list data and index column.
I try to analyse wind speed data from a lidar, creating a dataframe in which the columns are the investigated heights and the row is the number of NaNs at that elevation. My script creates the dataframe and names the columns as it required but it doesn't write the number of NaNs in the corresponding cells. Any idea what the problem might be?
df=pd.read_csv(fileApath,delimiter=',',skiprows=1)
heights = ['123','98','68','65','57','48','39','38','29','18','10']
nanvalues_speed=pd.DataFrame()
for i in heights:
nanvalues_speed[i+'m']=pd.notnull(df['Horizontal Wind Speed (m/s) at '+i+'m']).sum()
The function you are looking for is pandas.DataFrame.isna()
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isna.html
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1,2,3,np.nan ,5],
'b': ['a',np.nan ,'c','d','e']})
df.isna().sum()
The pandas.DataFrame.isna() funtion returns a boolean same-sized object indicating if the values in the DataFrame are NA.
I have a pandas DataFrame - assume, for example, that it contains some rolling covariance estimates:
df = #some data
rolling_covariance = df.rolling(window=100).cov()
If the data has n columns then rolling_covariance will contain m n by n matrices, where m is the number of rows in df.
Is there quick/one-liner to transform rolling_covariance into a numpy array of matrices? For example you can access individual rows in rolling_covariance using iloc, you can also iterate through all of the first level of the multiindex and extract the data that way - but I'd like something fast and simple if available
I'm trying to remove duplicate elements in a dataframe. This DataFrame comes from calculating the distance between a given list of geocoordinates. As you can see in the following DataFrame, the data is duplicated but I can't set the index to 'dist' because in other cases, the distance might be 0 or 1 (repeated) and then important data will be discarded.
import pandas as pd
df = pd.DataFrame({'Name_x':['a','b','c','d'],
'Name_y':['b','a','d','c'],
'Latitude_x':['lat_a','lat_b','lat_c','lat_d'],
'Longitude_x':['long_a','long_b','long_c','long_d'],
'Latitude_y':['lat_b','lat_a','lat_d','lat_c'],
'Longitude_y':['long_b','long_a','long_d','long_c'],
'dist':[0,0,1,1]})
df
In this case I would like to remain with the values Name_x: ['a','c'], Name_y['b','d'] with the corresponding geocoordinates: Latitude_x:['lat_a','lat_c'], Latitude_y:['lat_b','lat_d'], Longitude_x:['long_a','long_c'], Longitude_y: ['long_b','long_d'].
I'm not sure if you want this:
df['Name_x'].eq(df['Name_y'].shift()) # filter by equals for name
df.loc[df['Name_x'].eq(df['Name_y'].shift())] # Your "unique" rows
I have a pandas Series that contains key-value pairs, where the key is the name of a column in my pandas DataFrame and the value is an index in that column of the DataFrame.
For example:
Series:
Series
Then in my DataFrame:
Dataframe
Therefore, from my DataFrame I want to extract the value at index 12 from my DataFrame for 'A', which is 435.81 . I want to put all these values into another Series, so something like { 'A': 435.81 , 'AAP': 468.97,...}
My reputation is low so I can't post my images as images instead of links (can someone help fix this? thanks!)
I think this indexing is what you're looking for.
pd.Series(np.diag(df.loc[ser,ser.axes[0]]), index=df.columns)
df.loc allows you to index based on string indices. You get your rows given from the values in ser (first positional argument in df.loc) and you get your column location from the labels of ser (I don't know if there is a better way to get the labels from a series than ser.axes[0]). The values you want are along the main diagonal of the result, so you take just the diagonal and associate them with the column labels.
The indexing I gave before only works if your DataFrame uses integer row indices, or if the data type of your Series values matches the DataFrame row indices. If you have a DataFrame with non-integer row indices, but still want to get values based on integer rows, then use the following (however, all indices from your series must be within the range of the DataFrame, which is not the case with 'AAL' being 1758 and only 12 rows, for example):
pd.Series(np.diag(df.iloc[ser,:].loc[:,ser.axes[0]]), index=df.columns)