This should be easy, but can't figure out the right syntax. Let's say I get a numpy array of all NA locations for a particular column like so:
index = np.where(df['Gene'].isnull())[0]
I want to now examine those rows in the df. I've tried things like:
df.iloc[[index]]
df[[index]]
To no avail.
Related
I have a dataframe like below:
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': [12124,12124,5687,5687,7892],
'A': [np.nan,np.nan,3.05,3.05,np.nan],'B':[1.05,1.05,np.nan,np.nan,np.nan],'C':[np.nan,np.nan,np.nan,np.nan,np.nan],'D':[np.nan,np.nan,np.nan,np.nan,7.09]})
I want to get box plot of columns A, B, C, and D, where the redundant row values in each column needs to be counted once only. How do I accomplish that?
Because panda can only deal with the dataFrame that every column has same length as well as every row has same length. In other words, only frame-shape data could be process. If null values need to be counted only once, it may conflict the principles of "panda" package. Here is my suggestion: you could transform the dataframe into list .
The detailed code of transforming the dataFrame into list
Then you could try to plot the box plot from the list data and index column.
I have a Pandas DataFrame to which I would like to add a new column that I will then populate with numpy arrays, such that each row in that column contains a numpy array. I'm using the following approach, and am wondering whether this is the correct approach.
df['embeddings'] = pd.Series(dtype='object')
Then I would iterate over rows and add computed arrays like so (using np.zeros(1024) for illustration only, in reality these are the output of a neural network):
for i in range(df.shape[0]):
df['embeddings'].loc[i] = np.zeros(1024)
I tested whether it helps to pre-allocate the cells like so, but didn't notice a difference in execution time when I then iterate over rows, at least not with a DataFrame that only has 200 rows:
df['embeddings'] = [np.zeros(1024)] * df.shape[0]
As alternative to adding a column to then update the rows in it, one could create the list of numpy arrays first, to then add the list as a new column, but that would require more memory.
I have a Pandas Dataframe (dataset, 889x4) and a Numpy ndarray (targets_one_hot, 889X29), which I want to concatenate. Therefore, I want to convert the targets_one_hot into a Pandas Dataframe.
To do so, I looked at several suggestions. However, these suggestions are about smaller arrays, for which it is okay to write out the different columns.
For 29 columns, this seems inefficient. Who can tell me efficient ways to turn this Numpy array into a Pandas DataFrame?
We can wrap a numpy array in a pandas dataframe, by passing it as the first parameter. Then we can make use of pd.concat(..) [pandas-doc] to concatenate the original dataset, and the dataframe of the target_one_hot into a new dataframe. Since we here concatenate "vertically", we need to set the axis parameter on axis=1:
pd.concat((dataset, pd.DataFrame(targets_one_hot)), axis=1)
Here is what is happening:
df = pd.read_csv('data')
important_region = df[df.columns.get_loc('A'):df.columns.get_loc('C')]
important_region_arr = important_region.values
print(important_region_arr)
Now, here is the issue:
print(important_region.shape)
output: (5,30)
print(important_region_arr.shape)
output: (5,30)
print(important_region)
output: my columns, in the panda way
print(important_region_arr)
output: first 5 rows of the dataframe
How, having indexed my columns, do I transition to the numpy array?
Alternatively, I could just convert to numpy from the get-go and run the slicing operation within numpy. But, how is this done in pandas?
So here is how you can slice the dataset with specific columns. loc gives you access to the grup of rows and columns. The ones before , represents rows and columns after. If a : is specified it means all the rows.
data.loc[:,'A':'C']
For more understanding, please look at the documentation.
I have to write an object that takes either a pandas data frame or a numpy array as the input (similar to sklearn behavior). In one of the methods for this object, I need to select the columns (not a particular fixed one, I get a few column indices based on other calculations).
So, to make my code compatible with both input types, I tried to find a common way to select columns and tried methods like X[:,0](doesn't work on pandas dataframes), X[0] and others but they select differently. Is there a way to select columns in a similar fashion across pandas and numpy?
If no then how does sklearn work across these data structures?
You can use an if condition within your method and have separate selection methods for pandas dataframes and numpy arrays. Given sample code below.
def method_1(self, var, col_indices):
if isinstance(var, pd.DataFrame):
selected_columns = var[var.columns[col_indices]]
else:
selected_columns = var[:,col_indices]
Here, var is your input which can be a numpy array or pandas dataframe, col_indices are the indices of the columns you want to select.