I want to concatenate the multidimensional output of a NumPy computation matching in dimensions the shape of the input (with regards to rows and respective selected columns).
But it fails with: NotImplementedError: Can only union MultiIndex with MultiIndex or Index of tuples, try mi.to_flat_index().union(other) instead.
I do not want to flatten the indices first - so is there another way to get it to work?
import pandas as pd
from pandas import Timestamp
df = pd.DataFrame({('metrik_0', Timestamp('2020-01-01 00:00:00')): {(1, 1): 2.5393693602911447, (1, 5): 4.316896324314225, (1, 6): 4.271001191238499, (1, 9): 2.8712588011247377, (1, 11): 4.0458495954752545}, ('metrik_0', Timestamp('2020-01-01 01:00:00')): {(1, 1): 4.02779063729038, (1, 5): 3.3849606155101224, (1, 6): 4.284114856052976, (1, 9): 3.980919941298365, (1, 11): 5.042488191587525}, ('metrik_0', Timestamp('2020-01-01 02:00:00')): {(1, 1): 2.374592085569529, (1, 5): 3.3405503781564487, (1, 6): 3.4049690284720366, (1, 9): 3.892686173978996, (1, 11): 2.1876998087043127}})
def compute_return_columns_to_df(df, colums_to_process,axis=0):
method = 'compute_result'
renamed_base_levels = map(lambda x: f'{x}_{method}', colums_to_process.get_level_values(0).unique())
renamed_columns = colums_to_process.set_levels(renamed_base_levels, level=0)
#####
# perform calculation in numpy here
# for the sake of simplicity (and as the actual computation is irrelevant - it is omitted in this minimal example)
result = df[colums_to_process].values
#####
result = pd.DataFrame(result, columns=renamed_columns)
display(result)
return pd.concat([df, result], axis=1) # fails with: NotImplementedError: Can only union MultiIndex with MultiIndex or Index of tuples, try mi.to_flat_index().union(other) instead.
# I do not want to flatten the indices first - so is there another way to get it to work?
compute_return_columns_to_df(df[df.columns[0:3]].head(), df.columns[0:2])
The reason why your code failed is in:
result = df[colums_to_process].values
result = pd.DataFrame(result, columns=renamed_columns)
Note that result has:
column names with the top index level renamed to
metrik_0_compute_result (so far OK),
but the row index is the default single level index,
composed of consecutive numbers.
Then, when you concatenate df and result, Pandas attempts to
align both source DataFrames on the row index, but they are incompatible
(df has a MultiIndex, whereas result has an "ordinary" index).
Change this part of your code to:
result = df[colums_to_process]
result.columns = renamed_columns
This way result keeps the original index and concat raises no
exception.
Another remark: Your function contains axis parameter, which is
never used. Consider removing it.
Another possible approach
Since result has a default (single level) index, you can leave the
previous part of code as is, but reset the index in df before joining:
return pd.concat([df.reset_index(drop=True), result], axis=1)
This way both DataFrames have the same indices and you can concatenate
them as well.
Related
I have a dataset that contains a column that contains a tuple in the form ('String', int).
I would like to drop all the rows that contain ('String1', 1), ('String2', 1), and ('String3', 1). I have tried many things but can't get it to drop.
I'm not sure what your data looks like, but it sounds like you can just filter out values equal to ('String', 1)
df = df[df['your column'] != ('String', 1)]
For multiple values:
df = df[~(df['your column'].str[0].isin(['String1', 'String2', 'String3']) & df['your column'].str[1] == 1)]
I have a pandas dataframe for graph edges with a multi index as such
df = pd.DataFrame(index=[(1, 2), (2, 3), (3, 4), ...], data=['v1', 'v2', 'v3', ...])
However doing a simple .loc fails:
df.loc[(1, 2)] # error
df.loc[df.index[0]] # also error
with the message KeyError: 1. Why does it fail? The index clearly shows that the tuple (1, 2) is in it and in the docs I see .loc[] being used similarly.
Edit: Apparently df.loc[[(1, 2)]] works. Go figure. It was probably interpreting the first iterable as separate keys?
Turns out I needed to wrap the key in another iterable like a list for it to use the whole tuple instead of its elements like so: df.loc[[(1, 2)]].
It seems the default for pd.read_csv() is to read in the column names as str. I can't find the behavior documented and thus can't find where to change it.
Is there a way to tell read_csv() to read in the column names as integer?
Or maybe the solution is specifying the datatype when calling pd.DataFrame.to_csv(). Either way, at the time of writing to csv, the column names are integers and that is not preserved on read.
The code I'm working with is loosely related to this (credit):
df = pd.DataFrame(index=pd.MultiIndex.from_arrays([[], []]))
for row_ind1 in range(3):
for row_ind2 in range(3, 6):
for col in range(6, 9):
entry = row_ind1 * row_ind2 * col
df.loc[(row_ind1, row_ind2), col] = entry
df.to_csv("df.csv")
dfr = pd.read_csv("df.csv", index_col=[0, 1])
print(dfr.loc[(0, 3), 6]) # KeyError
print(dfr.loc[(0, 3), "6"]) # No KeyError
My temporary solution is:
dfr.columns = dfr.columns.map(int)
I'm trying to run a function over many partitions of a Dask dataframe. The code requires unpacking tuples and works well with Pandas but not with Dask map_partitions. The data corresponds to lists of tuples, where the length of the lists can vary, but the tuples are always of a known fixed length.
import dask.dataframe as dd
import pandas as pd
def func(df):
for index, row in df.iterrows():
tuples = row['A']
for t in tuples:
x, y = t
# Do more stuff
# Create Pandas dataframe
# Each list may have a different length, tuples have fixed known length
df = pd.DataFrame({'A': [[(1, 1), (3, 4)], [(3, 2)]]})
# Pandas to Dask
ddf = dd.from_pandas(df, npartitions=2)
# Run function over Pandas dataframe
func(df)
# Run function over Dask dataframe
ddf.map_partitions(func).compute()
Here, the Pandas version runs with no issues. However, the Dask one, raises the error:
ValueError: Metadata inference failed in `func`.
You have supplied a custom function and Dask is unable to
determine the type of output that that function returns.
To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.
Original error is below:
------------------------
ValueError('not enough values to unpack (expected 2, got 1)')
In my original function, I'm using these tuples as auxiliary variables, and the data which is finally returned is completely different so using meta doesn't fix the problem. How can I unpack the tuples?
When you use map_partitions without specifying meta, dask will try to run the functions to infer what the output is. This can cause problems if your function is not compatible with the sample dataframe used, you can see this sample dataframe with ddf._meta_nonempty (in this case it will return a column of foo).
An easy fix in this case is to provide meta, it's okay for returned data to be of different format, e.g. if each returned result is a list, you can provide meta=list:
import dask.dataframe as dd
import pandas as pd
def func(df):
for index, row in df.iterrows():
tuples = row['A']
for t in tuples:
x, y = t
return [1,2,3]
df = pd.DataFrame({'A': [[(1, 1), (3, 4)], [(3, 2)]]})
ddf = dd.from_pandas(df, npartitions=2)
ddf.map_partitions(func, meta=list).compute()
Another approach is to make your function compatible with the sample dataframe used. The sample dataframe has an object column but it contains foo rather than a list of tuples, so it cannot be unpacked as a tuple. Modifying your function to accept non-tuple columns (with x, *y = t) will make it work:
import dask.dataframe as dd
import pandas as pd
def func(df):
for index, row in df.iterrows():
tuples = row['A']
for t in tuples:
x, *y = t
return [1,2,3]
df = pd.DataFrame({'A': [[(1, 1), (3, 4)], [(3, 2)]]})
ddf = dd.from_pandas(df, npartitions=2)
#notice that no meta is specified here
ddf.map_partitions(func).compute()
I have a Pandas DataFrame. How do I create a new column that is like a count of the Pandas DataFrame because I already made my index a Datatime.
For example, the following code is reproducible on your local PC:
import datetime
import numpy
dates = [
datetime.date(2019, 1, 13),
datetime.date(2020, 5, 11),
datetime.date(2018, 7, 24),
datetime.date(2019, 3, 23),
datetime.date(2020, 2, 16)
]
data = {
"a": [13.3,12.3,np.nan,10.3,np.nan],
"b": [1,0,0,1,1],
"c": ["no","yes","no","","yes"]
}
pd.DataFrame(index=dates,data=data)
Right now, I would like to add a new column as a count. Something like 1,2,3,4,5 until the end of the data
df['count'] = range(1, len(df) + 1)
len(df) returns the number of rows in the DataFrame, so you can call the builtin range function to create a range from 1 to the number of rows in the DataFrame, and then assign it to a new column. When assigning a range to a column, it is automatically converted to a pandas Series.
You can build a Series using df.index and apply some processing to it before assigning it to a column of the dataframe.
Here, we could use:
df['count'] = pd.Series(1, index=df.index()).cumsum()
Here it would be far less efficient (more than 1 magnitude order) than df['count'] = np.arange(1, 1 + len(df)) that directly builds a numpy array with the expected values, but it can be useful in more complex uses cases.