I have a dataset that has 2 copies of each record. Each record has an ID, and each copy has the same ID.
15 of 18 fields are identical in both copies of the records. But in 3 fields, the top row contains 2 items and 1 NAN; the bottom row contains 1 item (where top row had a NAN) and 2 NANs (where top row had items). Sometimes there are random NANs that don't follow this pattern.
I need to collapse each record into one so that I have a single record that contains all 3 non-NAN fields.
I have tried various versions of groupby. But that omits the 3 fields I need, which are all string-based. And it doubles the values of certain numeric fields.
If all else fails, I'll turn the letter fields into number codes and df.groupby(['ID']).agg('sum')
But I figure there's probably a smarter way to do this.
Related
I have a dataframe containing duplicates, that are flagged by a specific variable. The df looks like this:
enter image description here
The idea is that the rows to keep and its duplicates are stacked in batches (a pair or more if many duplicates)and identified by the "duplicate" column. I would like, for each batch, to keep the row depending on one conditions: it has to be the row with the smallest number of empty cells. For Alice for instance, it should be the second row (and not the one flagged "keep").
The difficulty lies also in the fact that I cannot group by on the "name", "lastname" or "phone" column, because they are not always filled (the duplicates are computed on these 3 concatenated columns by a ML algo).
Unlike already posted questions I've seen (how do I remove rows with duplicate values of columns in pandas data frame?), here the conditions to select the row to keep is not fixed (like keeping the first row or the last withing the batch of duplicates) but depends on the rows completion in each batch.
How can I parse the dataframe according to this column "duplicate", and among each batch extract the row I want ?
I tried to assign an unique label for each batch, in order to iterate over these label, but it fails.
I have the following dataframe, which contains 2 rows:
index name food color number year hobby music
0 Lorenzo pasta blue 5 1995 art jazz
1 Lorenzo pasta blue 3 1995 art jazz
I want to write a code that will be able to tell me which column is the one that can distinguish between the these two rows.
For example , in this dataframe, the column "number" is the one that distinguish between the two rows.
Unti now I have done this very simply by just go over column after column using iloc and see the values.
duplicates.iloc[:,3]
>>>
0 blue
1 blue
It's important to take into account that:
This should be for loop, each time I check it on new generated dataframe.
There may be nore than 2 rows which I need to check
There may be more than 1 column that can distinguish between the rows.
I thought that the way to check such a thing will be something like take each time one column, get the unique values and check if they are equal to each other ,similarly to this:
for n in np.arange(0,len(df.columns)):
tmp=df.iloc[:,n]
and then I thought to compare if all the values are similar to each other on the temporal dataframe, but here I got stuck because sometimes I have many rows and also I need.
My end goal: to be able to check inside for loop to identify the column that has different values in each row of the temporaldtaframe, hence can help to distinguish between the rows.
You can apply the duplicated method on all columns:
s = df.apply(pd.Series.duplicated).any()
s[~s].index
Output: ['number']
I have a pandas DataFrame, let's say its named "df", with numerical values inside it in all columns (floats). I want to retrieve the top 5 highest absolute values from the dataframe, together with their row and column labels.
I've seen suggestions like:
df.abs().stack().nlargest(5)
but the stack method doesn't keep the row and column labels for all elements, it enumerates one of the axis and, for each element, then enumerates the other axis, with a blank element before. I need the value and the names of BOTH the column and the row.
I know I can do this by iterating over each column, then each row inside it, then accessing the value and appending to 3 lists, one with row names, other with column names and a third with the values, then copying the values list to have a fourth list with the absolute values, using this last list to get the positions of the 5 highest values, and using those positions to index the first 3 lists, therefore getting the row name, column name and value. There must be a better, more compact and more pythonic way though, but I seriously cannot find it anywhere, and I am usually good at gooling my issues away.
The suggested solution contains the row and column labels in the index and are not lost.
A simple example where the appropriate names are reattached:
df = pd.DataFrame({'a': np.random.random(100), 'b': np.random.random(100)})
df.abs().stack().nlargest(5).rename('value').rename_axis(['row', 'column']).reset_index()
Result:
row column value
0 87 a 0.958382
1 49 a 0.953590
2 55 a 0.952150
3 31 b 0.949763
4 4 b 0.931452
I have an array here that describes some data in a pandas Panel. I would like to drop the NaNs (which are rows along the major axis) and leave the data intact but it seems that calling .dropna(axis=1, how='any') will discard one row from the item that has 10 good rows and calling .dropna(axis=1, how='all') will leave one row of NaNs on the item that has 9 good rows. How can I dispose of the NaNs without loosing data?
You still need to have the same dimensions in the two items of your panel. So because in the second item you have 4 NaN rows and in the first 3, you will always have to either keep one NaN row in the second item or throw away one non-NaN row in the first item. If you don't want that, then you have to work in two seperate dataframes so they can end up with a different number of rows.
My program randomizes the rows in my csv file. There are 120 rows with 5 columns. The first two columns each contain the name of an image. There are about 30 (out of 2 columns*120 rows=240) image names that repeat, once each. The repeat (second copy) of an image name isn't necessarily in the same column as the first one (it may be in column 1 OR 2), but it might be.
I need to have the program check whether each of the 2 images in each row (columns 1 and 2) is also in one of the following 8 rows. If it is, I need it to move the row with the second instance of the image to the end of the file. Then I need it to check again (because some images repeat twice), and if it finds ANOTHER repeat, I need it to put that second repeat row somewhere else in the file. Don't care where, just as long as it's not within 7 rows of the target row (the one being tested).