reset_index() to original column indices after pandas groupby()? - python

I generate a grouped dataframe df = df.groupby(['X','Y']).max() which I then want to write (to csv, without indexes). So I need to convert 'X' and 'Y' back to regular columns; I tried using reset_index(), but the order of columns was wrong.
How to restore columns 'X' and 'Y' to their exact original column position?
Is the solution:
df.reset_index(level=0, inplace=True)
and then find a way to change the order of the columns?
(I also found this approach, for multiindex)

This solution keeps the columns as-is and doesn't create indexes, after grouping, hence we don't need reset_index() and column reordering at the end:
df.groupby(['X','Y'],as_index=False).max()
(After testing a lot of different methods, the simplest one was the best solution (as always) and the one which eluded me the longest. Thanks to #maxymoo for pointing it out.)

Related

Python dataframe; trouble changing value of column with multiple filters

I have a large dataframe I took off an ODBC database. The Dataframe has multiple columns; I'm trying to change the values of one column by filtering two other.
First, I filter my dataframe data_prem with both conditions which gives me the correct rows:
data_prem[(data_prem['PRODUCT_NAME']=='ŽZ08') & (data_prem['BENEFIT'].str.contains('19.08.16'))]
Then I use the replace function on the selection to change 'M' value to 'H' value:
data_prem[(data_prem['PRODUCT_NAME']=='ŽZ08') & (data_prem['BENEFIT'].str.contains('19.08.16'))]['Reinsurer'].replace(to_replace='M',value='H',inplace=True,regex=True)
Python warns me I'm trying to modify a copy of the dataframe, even though I'm clearly refering to the original dataframe (I'm posting image so you can see my results).
dataframe filtering
I also tried using .loc function in the following manner:
data_prem.loc[((data_prem['PRODUCT_NAME']=='ŽZ08') & (data_prem['BENEFIT'].str.contains('19.08.16'))),'Reinsurer'] = 'H'
which changed all rows that fit the second condition (str.contains...), but it didn't apply the first condition. I got replacements in the 'Reinsurer' column for other 'PRODUCT_NAME' values as well.
I've been scouring the web for an answer to this for some time. I've seen some mentions of a bug in the pandas library, not sure if this is what they were talking about.
I would value any opinions you might have, would also be interesting in alternative ways to solving this problem. I filled the 'Reinsurer' column with the map function with 'PRODUCT_NAME' as the input (had a dictionary that connected all 'PRODUCT_NAME' values with 'Reinsurer' values).
Given your Boolean mask, you've demonstrated two ways of applying chained indexing. This is the cause of the warning and the reason why you aren't seeing your logic being applied as you anticipate.
mask = (data_prem['PRODUCT_NAME']=='ŽZ08') & df['BENEFIT'].str.contains('19.08.16')
Chained indexing: Example #1
df[mask]['Reinsurer'].replace(to_replace='M', value='H', inplace=True, regex=True)
Chained indexing: Example #2
df[mask].loc[mask, 'Reinsurer'] = 'H'
Avoid chained indexing
You can keep things simple by applying your mask once and using a single loc call:
df.loc[mask, 'Reinsurer'] = 'H'

How can I specify row order when I use dask.dataframe

I have two dataframe with same shape.
I tried to convert to dask dataframe specifying same n_partition=50.
However, how each dataframe split into partition seems different as shown below image.
Does anyone know how I can specify how dataframe should be separated?
Here is a guess: the index values appear to be sorted, but one would be numerical and one lexicographical; i.e., I suspect that your dataframe mrt_dask has an index containing strings, not numbers. If this is so, then calling astype before passing it to dask should solve your issue, or perhaps you should change how it is being loaded in the first place.

What is the the best way to modify (e.g., perform math functions) a column in a Dask DataFrame?

I'm a veteran of Pandas DataFrame objects, but I'm struggling to find a clean, convenient method for altering the values in a Dask DataFrame column. For a specific example, I'm trying to multiply positive values in a numpy.float column by -1, thereby making them negative. Here is my current method (I'm trying to change the last column in the DataFrame):
cols = df.columns
df[[cols[-1]]] = df[[cols[-1]]]*-1
This seems to work only if the column has a string header, otherwise it adds another column using the index number as a string-type column name for a new column. Is there something akin to the Pandas method of, say, df.iloc[-1,:] = df.iloc[-1,:]*-1 that I can use with a Dask dataframe?
Edit: I'm also trying to implement: df = df.applymap(lambda x: x*-1). This, of course, applies the function to the entire dataframe, but is there a way to apply a function over just one column? Thank you.
first question
If something works for string columns and not for numeric-named columns then that is probably a bug. I recommend raising an issue at https://github.com/dask/dask/issues/new
second question
but is there a way to apply a function over just one column?
You can't apply a single Python function over a dask dataframe that is stored in many pieces directly, however methods like .map_partitions or .reduction may help you to achieve the same result with some cleverness.
in the future we recommend asking separate questions separately on stack overflow

Python Pandas: How I can unique my table only based on certain columns?

I have a df :
How can I remove duplicates based on of only one column? Because I have rows that all of their columns are the same but only one is not. I want to ignore that column and get the unique values based on the other column?
That is how I tried but I get an error on it:
data.drop_duplicates('asn','first_seen','incident_type','ip','uri')
Any idea?
What version of pandas are you running? I believe that since >0.14 you should provide a list of columns to drop_duplicates() using the subset keyword, so try
data.drop_duplicates(subset=['asn','first_seen','incident_type','ip','uri'])
Also note that if you are not using inplace=True you will need to assign the returned value to a new dataframe.
Depending on your needs, you may also want to call reset_index() after dropping the duplicate rows.

pandas: how to merge multiple indexes together?

I have two levels of index, and I want to merge them into one level of index. I looked at methods such as reset_index and reindex, but they don't seem to be what I need. Another way I can think of is adding a new column containing the merged indexes, set that column as the new index using pivot_table, and delete the old indexes. But I'm wondering is there a more elegant way to do this. Any input is welcomed. Thank you so much!
Doing
df.index = df.index.values
will give you tuples in a single level if that's what you mean by 'merge'.

Categories

Resources