I have an array here that describes some data in a pandas Panel. I would like to drop the NaNs (which are rows along the major axis) and leave the data intact but it seems that calling .dropna(axis=1, how='any') will discard one row from the item that has 10 good rows and calling .dropna(axis=1, how='all') will leave one row of NaNs on the item that has 9 good rows. How can I dispose of the NaNs without loosing data?
You still need to have the same dimensions in the two items of your panel. So because in the second item you have 4 NaN rows and in the first 3, you will always have to either keep one NaN row in the second item or throw away one non-NaN row in the first item. If you don't want that, then you have to work in two seperate dataframes so they can end up with a different number of rows.
Related
I have a dataframe containing duplicates, that are flagged by a specific variable. The df looks like this:
enter image description here
The idea is that the rows to keep and its duplicates are stacked in batches (a pair or more if many duplicates)and identified by the "duplicate" column. I would like, for each batch, to keep the row depending on one conditions: it has to be the row with the smallest number of empty cells. For Alice for instance, it should be the second row (and not the one flagged "keep").
The difficulty lies also in the fact that I cannot group by on the "name", "lastname" or "phone" column, because they are not always filled (the duplicates are computed on these 3 concatenated columns by a ML algo).
Unlike already posted questions I've seen (how do I remove rows with duplicate values of columns in pandas data frame?), here the conditions to select the row to keep is not fixed (like keeping the first row or the last withing the batch of duplicates) but depends on the rows completion in each batch.
How can I parse the dataframe according to this column "duplicate", and among each batch extract the row I want ?
I tried to assign an unique label for each batch, in order to iterate over these label, but it fails.
I have two Pandas DataFrames with one column in common, namely "Dates". I need to merge these two where "Dates" correspond. with pd.merge() it does the expected but removes the uncorresponding values. I want to keep other values too.
Ex: I have historical data for a stock for 1 min. and a calculated indicator for 5min. data ie. for each 5 rows I have a new value calculated in 1 min Data Frame.
I know that Series.dt.floor method may reveal useful here but I couldn't figure out.
I concatenated respective "Dates" to calculated indicator Series so that I can merge them where column matches. I obtained a right result but missing values. I need a continuity of 1 min values, i.e. same indicator must be valid for the next 5 entries then the second indicator value's turn to be merged.
df1.merge(df2, left_on='Dates', right_on='Dates')
I have a pandas DataFrame, let's say its named "df", with numerical values inside it in all columns (floats). I want to retrieve the top 5 highest absolute values from the dataframe, together with their row and column labels.
I've seen suggestions like:
df.abs().stack().nlargest(5)
but the stack method doesn't keep the row and column labels for all elements, it enumerates one of the axis and, for each element, then enumerates the other axis, with a blank element before. I need the value and the names of BOTH the column and the row.
I know I can do this by iterating over each column, then each row inside it, then accessing the value and appending to 3 lists, one with row names, other with column names and a third with the values, then copying the values list to have a fourth list with the absolute values, using this last list to get the positions of the 5 highest values, and using those positions to index the first 3 lists, therefore getting the row name, column name and value. There must be a better, more compact and more pythonic way though, but I seriously cannot find it anywhere, and I am usually good at gooling my issues away.
The suggested solution contains the row and column labels in the index and are not lost.
A simple example where the appropriate names are reattached:
df = pd.DataFrame({'a': np.random.random(100), 'b': np.random.random(100)})
df.abs().stack().nlargest(5).rename('value').rename_axis(['row', 'column']).reset_index()
Result:
row column value
0 87 a 0.958382
1 49 a 0.953590
2 55 a 0.952150
3 31 b 0.949763
4 4 b 0.931452
I have a dataset that has 2 copies of each record. Each record has an ID, and each copy has the same ID.
15 of 18 fields are identical in both copies of the records. But in 3 fields, the top row contains 2 items and 1 NAN; the bottom row contains 1 item (where top row had a NAN) and 2 NANs (where top row had items). Sometimes there are random NANs that don't follow this pattern.
I need to collapse each record into one so that I have a single record that contains all 3 non-NAN fields.
I have tried various versions of groupby. But that omits the 3 fields I need, which are all string-based. And it doubles the values of certain numeric fields.
If all else fails, I'll turn the letter fields into number codes and df.groupby(['ID']).agg('sum')
But I figure there's probably a smarter way to do this.
I have a DataFrame containing many NaN values. I want to delete rows that contain too many NaN values; specifically: 7 or more.
I tried using the dropna function several ways but it seems clear that it greedily deletes columns or rows that contain any NaN values.
This question (Slice Pandas DataFrame by Row), shows me that if I can just compile a list of the rows that have too many NaN values, I can delete them all with a simple
df.drop(rows)
I know I can count non-null values using the count function which I could them subtract from the total and get the NaN count that way (Is there a direct way to count NaN values in a row?). But even so, I am not sure how to write a loop that goes through a DataFrame row-by-row.
Here's some pseudo-code that I think is on the right track:
### LOOP FOR ADDRESSING EACH row:
m = total - row.count()
if (m > 7):
df.drop(row)
I am still new to Pandas so I'm very open to other ways of solving this problem; whether they're simpler or more complex.
Basically the way to do this is determine the number of cols, set the minimum number of non-nan values and drop the rows that don't meet this criteria:
df.dropna(thresh=(len(df) - 7))
See the docs
The optional thresh argument of df.dropna lets you give it the minimum number of non-NA values in order to keep the row.
df.dropna(thresh=df.shape[1]-7)