How can I specify row order when I use dask.dataframe - python

I have two dataframe with same shape.
I tried to convert to dask dataframe specifying same n_partition=50.
However, how each dataframe split into partition seems different as shown below image.
Does anyone know how I can specify how dataframe should be separated?

Here is a guess: the index values appear to be sorted, but one would be numerical and one lexicographical; i.e., I suspect that your dataframe mrt_dask has an index containing strings, not numbers. If this is so, then calling astype before passing it to dask should solve your issue, or perhaps you should change how it is being loaded in the first place.

Related

Mapping values from one df to another when dtypes are Objects

I have 2 dataframes that I am trying to map to each other - the one I created from a csv file and the other from an Excel file. I am trying to map (something like a vlookup in Excel) the name of the one df to the respective code in the other.
When trying to do the same task with another dataframe where the values in the 'key' column were of type 'integer' the process worked well. In this case, both columns are of type Object and I am able to map only one value and for others receive a "NaN". The one value that is successfully mapped is a value with 2 decimals points (for example 9.5.5) which is why I presume pandas treats the column as an object and not integer.
I have attempted the following:
Changing the dtypes of both columns to strings and then trying to map them:
df_1['code'] = df_1['code'].astype(str)
df_2['code'] = df_2['code'].astype(str)
I adjusted the index of both so that the map function works with index instead of columns:
df_1.set_index('code', inplace=True)
df_2.set_index('code', inplace=True)
The mapping was done using the following code:
df_1['code_name'] = df_1.index.map(df_2['code_name'])
Not too sure what else is possible. I cannot use .apply or .applymap functions as python states that my data is a dataframe type. I did also attempt to use the .squeeze() function to convert the df_2 to a Series and I still got the same results - namely NaN for the majority and only certain values changed. If it helps, the values that were mapped are aligned to the Left (when opened in Excel) while those unmapped with NaN are aligned to the Right (perhaps seen as numbers).
If possible, I prefer to use the .map function opposed to .merge as it is faster.

Problems with DataFrame indexing with pandas

Using pandas, I have to modify a DataFrame so that it only has the indexes that are also present in a vector, which was acquired by performing operations in one of the df's columns. Here's the specific line of code used for that (please do not mind me picking the name 'dataset' instead of 'dataframe' or 'df'):
dataset = dataset.iloc[list(set(dataset.index).intersection(set(vector.index)))]
it worked, and the image attached here shows the df and some of its indexes. However, when I try accessing a specific value by index in the new 'dataset', such as the line shown below, I get an error: single positional indexer is out-of-bounds
print(dataset.iloc[:, 21612])
note: I've also tried the following, to make sure it isn't simply an issue with me not knowing how to use iloc:
print(dataset.iloc[21612, :])
and
print(dataset.iloc[21612])
Do I have to create another column to "mimic" the actual indexes? What am I doing wrong? Please mind that it's necessary for me to make it so the indexes are not changed at all, despite the size of the DataFrame changing. E.g. if the DataFrame originally had 21000 rows and the new one only 15000, I still need to use the number 20999 as an index if it passed the intersection check shown in the first code snippet. Thanks in advance
Try this:
print(dataset.loc[21612, :])
After you have eliminated some of the original rows, the first (i.e., index) argument to iloc[] must not be greater than len(index) - 1.

Can I force Python to return only in String-format when I concatenate two series of strings?

I want to concatenate two columns in pandas containing mostly string values and some missing values. The result should be a new column which again contain string values and missings. Mostly it just worked fine with this:
df['newcolumn']=df['column1']+df['column2']
Most of the values in column1 are numbers (interpreted as strings) like 82. But some of the values in column2 are a composition of letters and numbers starting with an E like E52 or E83. When now 82 and E83 are concatenated, the result I want is 82E83. Unfortunately the results then is 8,2E+84. I guess, Python implicitly interpeted this as a number with scientific notation.
I already tried different ways of concatenating and forcing string format, but the result is always the same:
df['newcolumn']=(df['column1']+df['column2']).asytpe(str)
or
df['newcolumn']=(df['column1'].str.cat(df['column2'])).asytpe(str)
It seems Python first create a float, creating this not wanted format and then change the type to string, keeping results like 8,2E+84. Is there a solution for strictly keeping string format?
Edit: Thanks for your comments. As I tried to reproduce the problem myself with a very short dataframe, the problem also didn't occur. Finally I realized that it was only a problem with Excel automatically intepreting the cells as (wrong) numbers (in the CSV-Output). I didn't realize it before, because another dataframe coming from a CSV-File I used for merging with this dataframe on this concatenated strings was also already "destroyed" the same way by Excel. So the merging didn't work properly and I thought the concatenating in Python is the problem. I used to view the dataframe with Excel because it is really big. In the future I will be more carefully with this. My apologies for misplacing the problem!
Type conversion is not required in this case. You can simply use
df["newcolumn"] = df.apply(lambda x: f"{str(x[0])}{str(x[1])}", axis = 1)
Output:

What is the the best way to modify (e.g., perform math functions) a column in a Dask DataFrame?

I'm a veteran of Pandas DataFrame objects, but I'm struggling to find a clean, convenient method for altering the values in a Dask DataFrame column. For a specific example, I'm trying to multiply positive values in a numpy.float column by -1, thereby making them negative. Here is my current method (I'm trying to change the last column in the DataFrame):
cols = df.columns
df[[cols[-1]]] = df[[cols[-1]]]*-1
This seems to work only if the column has a string header, otherwise it adds another column using the index number as a string-type column name for a new column. Is there something akin to the Pandas method of, say, df.iloc[-1,:] = df.iloc[-1,:]*-1 that I can use with a Dask dataframe?
Edit: I'm also trying to implement: df = df.applymap(lambda x: x*-1). This, of course, applies the function to the entire dataframe, but is there a way to apply a function over just one column? Thank you.
first question
If something works for string columns and not for numeric-named columns then that is probably a bug. I recommend raising an issue at https://github.com/dask/dask/issues/new
second question
but is there a way to apply a function over just one column?
You can't apply a single Python function over a dask dataframe that is stored in many pieces directly, however methods like .map_partitions or .reduction may help you to achieve the same result with some cleverness.
in the future we recommend asking separate questions separately on stack overflow

reset_index() to original column indices after pandas groupby()?

I generate a grouped dataframe df = df.groupby(['X','Y']).max() which I then want to write (to csv, without indexes). So I need to convert 'X' and 'Y' back to regular columns; I tried using reset_index(), but the order of columns was wrong.
How to restore columns 'X' and 'Y' to their exact original column position?
Is the solution:
df.reset_index(level=0, inplace=True)
and then find a way to change the order of the columns?
(I also found this approach, for multiindex)
This solution keeps the columns as-is and doesn't create indexes, after grouping, hence we don't need reset_index() and column reordering at the end:
df.groupby(['X','Y'],as_index=False).max()
(After testing a lot of different methods, the simplest one was the best solution (as always) and the one which eluded me the longest. Thanks to #maxymoo for pointing it out.)

Categories

Resources