Pandas multi index loc - python

I have a pandas dataframe for graph edges with a multi index as such
df = pd.DataFrame(index=[(1, 2), (2, 3), (3, 4), ...], data=['v1', 'v2', 'v3', ...])
However doing a simple .loc fails:
df.loc[(1, 2)] # error
df.loc[df.index[0]] # also error
with the message KeyError: 1. Why does it fail? The index clearly shows that the tuple (1, 2) is in it and in the docs I see .loc[] being used similarly.
Edit: Apparently df.loc[[(1, 2)]] works. Go figure. It was probably interpreting the first iterable as separate keys?

Turns out I needed to wrap the key in another iterable like a list for it to use the whole tuple instead of its elements like so: df.loc[[(1, 2)]].

Related

How can I extract a dictionary into Excel?

I'm trying to extract the following dictionary using a pandas data frame into Excel:
results = {'ZF_DTSPP': [735.0500558302846,678.5413714617252,772.0300704610595,722.254907241738,825.2955175305726], 'ZF_DTSPPG': [732.0500558302845,637.4786326591071,655.8462451037873,721.404907241738,821.8455175305724]}
This is my code:
df = pd.DataFrame(data=results, index=[5, 2])
df = (df.T)
print(df)
df.to_excel('dict1.xlsx')
Somehow I always receive following error:
"ValueError: Shape of passed values is (5, 2), indices imply (2, 2)".
What can I do? How do I need to adapt the index?
Is there a way to compare the different values of "ZF_DTSPP" and "ZF_DTSPPG" directly with python?
You can use pd.DataFrame.from_dict as shown in pandas-from-dict, then your code:
df = pd.DataFrame.from_dict(results)

Unpack tuple inside function when using Dask map partitions

I'm trying to run a function over many partitions of a Dask dataframe. The code requires unpacking tuples and works well with Pandas but not with Dask map_partitions. The data corresponds to lists of tuples, where the length of the lists can vary, but the tuples are always of a known fixed length.
import dask.dataframe as dd
import pandas as pd
def func(df):
for index, row in df.iterrows():
tuples = row['A']
for t in tuples:
x, y = t
# Do more stuff
# Create Pandas dataframe
# Each list may have a different length, tuples have fixed known length
df = pd.DataFrame({'A': [[(1, 1), (3, 4)], [(3, 2)]]})
# Pandas to Dask
ddf = dd.from_pandas(df, npartitions=2)
# Run function over Pandas dataframe
func(df)
# Run function over Dask dataframe
ddf.map_partitions(func).compute()
Here, the Pandas version runs with no issues. However, the Dask one, raises the error:
ValueError: Metadata inference failed in `func`.
You have supplied a custom function and Dask is unable to
determine the type of output that that function returns.
To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.
Original error is below:
------------------------
ValueError('not enough values to unpack (expected 2, got 1)')
In my original function, I'm using these tuples as auxiliary variables, and the data which is finally returned is completely different so using meta doesn't fix the problem. How can I unpack the tuples?
When you use map_partitions without specifying meta, dask will try to run the functions to infer what the output is. This can cause problems if your function is not compatible with the sample dataframe used, you can see this sample dataframe with ddf._meta_nonempty (in this case it will return a column of foo).
An easy fix in this case is to provide meta, it's okay for returned data to be of different format, e.g. if each returned result is a list, you can provide meta=list:
import dask.dataframe as dd
import pandas as pd
def func(df):
for index, row in df.iterrows():
tuples = row['A']
for t in tuples:
x, y = t
return [1,2,3]
df = pd.DataFrame({'A': [[(1, 1), (3, 4)], [(3, 2)]]})
ddf = dd.from_pandas(df, npartitions=2)
ddf.map_partitions(func, meta=list).compute()
Another approach is to make your function compatible with the sample dataframe used. The sample dataframe has an object column but it contains foo rather than a list of tuples, so it cannot be unpacked as a tuple. Modifying your function to accept non-tuple columns (with x, *y = t) will make it work:
import dask.dataframe as dd
import pandas as pd
def func(df):
for index, row in df.iterrows():
tuples = row['A']
for t in tuples:
x, *y = t
return [1,2,3]
df = pd.DataFrame({'A': [[(1, 1), (3, 4)], [(3, 2)]]})
ddf = dd.from_pandas(df, npartitions=2)
#notice that no meta is specified here
ddf.map_partitions(func).compute()

Pandas concatenate Multiindex columns with same row index

I want to concatenate the multidimensional output of a NumPy computation matching in dimensions the shape of the input (with regards to rows and respective selected columns).
But it fails with: NotImplementedError: Can only union MultiIndex with MultiIndex or Index of tuples, try mi.to_flat_index().union(other) instead.
I do not want to flatten the indices first - so is there another way to get it to work?
import pandas as pd
from pandas import Timestamp
df = pd.DataFrame({('metrik_0', Timestamp('2020-01-01 00:00:00')): {(1, 1): 2.5393693602911447, (1, 5): 4.316896324314225, (1, 6): 4.271001191238499, (1, 9): 2.8712588011247377, (1, 11): 4.0458495954752545}, ('metrik_0', Timestamp('2020-01-01 01:00:00')): {(1, 1): 4.02779063729038, (1, 5): 3.3849606155101224, (1, 6): 4.284114856052976, (1, 9): 3.980919941298365, (1, 11): 5.042488191587525}, ('metrik_0', Timestamp('2020-01-01 02:00:00')): {(1, 1): 2.374592085569529, (1, 5): 3.3405503781564487, (1, 6): 3.4049690284720366, (1, 9): 3.892686173978996, (1, 11): 2.1876998087043127}})
def compute_return_columns_to_df(df, colums_to_process,axis=0):
method = 'compute_result'
renamed_base_levels = map(lambda x: f'{x}_{method}', colums_to_process.get_level_values(0).unique())
renamed_columns = colums_to_process.set_levels(renamed_base_levels, level=0)
#####
# perform calculation in numpy here
# for the sake of simplicity (and as the actual computation is irrelevant - it is omitted in this minimal example)
result = df[colums_to_process].values
#####
result = pd.DataFrame(result, columns=renamed_columns)
display(result)
return pd.concat([df, result], axis=1) # fails with: NotImplementedError: Can only union MultiIndex with MultiIndex or Index of tuples, try mi.to_flat_index().union(other) instead.
# I do not want to flatten the indices first - so is there another way to get it to work?
compute_return_columns_to_df(df[df.columns[0:3]].head(), df.columns[0:2])
The reason why your code failed is in:
result = df[colums_to_process].values
result = pd.DataFrame(result, columns=renamed_columns)
Note that result has:
column names with the top index level renamed to
metrik_0_compute_result (so far OK),
but the row index is the default single level index,
composed of consecutive numbers.
Then, when you concatenate df and result, Pandas attempts to
align both source DataFrames on the row index, but they are incompatible
(df has a MultiIndex, whereas result has an "ordinary" index).
Change this part of your code to:
result = df[colums_to_process]
result.columns = renamed_columns
This way result keeps the original index and concat raises no
exception.
Another remark: Your function contains axis parameter, which is
never used. Consider removing it.
Another possible approach
Since result has a default (single level) index, you can leave the
previous part of code as is, but reset the index in df before joining:
return pd.concat([df.reset_index(drop=True), result], axis=1)
This way both DataFrames have the same indices and you can concatenate
them as well.

Pandas concat: why is the `DataFrame` with duplicated index not working with concat()?

Code example:
a = pd.DataFrame({"a": [1,2,3],}, index=[1,2,2])
b = pd.DataFrame({"b": [1,4,5],}, index=[1,4,5])
pd.concat([a, b], axis=1)
It raises error: ValueError: Shape of passed values is (7, 2), indices imply (5, 2)
What I expected as a result:
Why does it not return like this? concat's default joining is outer so I think my thought is reasonable enough... Am I missing something?
TLDR: Why? I don't really know for sure, but I think it has to do with just the design of the package.
An index in pandas "is like an address, that’s how any data point across the dataframe or series can be accessed. Rows and columns both have indexes, rows indices are called as index and for columns its general column names." source
Now you are doing it where axis = 1, aka along the vertical axis. That means that we have an address which points to two different values. Hence we can still "access" these values by doing a[a.index == 2]. Do note however the index in a mathematical sense is now not a proper function because one value maps to two different values source. I am guessing the implementation was designed so that indices would be injective, surjective, or bijective in order to make it easier to design.
Thus, when attempting to concatenate, pandas wants to match all the indices together where possible and fill in nans where not possible. However, as the error says, it thinks the shape based off the indices is (5, 2) because of this address sharing two different values. So why doesn't it work? Because I believe pandas checks the shape it should be before hand, and then does the concatenation. In order to check the shape before hand it looks at the indices and therefore it breaks when it checks.
Do note too that this would not work with identical column names as well:
a = pd.DataFrame({"a": [1,2,3], 'b': [9,8,7]}, index=[1,2,2])
b = pd.DataFrame({"b": [1,4,5], 'bx': [1,4,3]}, index=[1,4,5]).rename(columns={'bx': 'b'})
pd.concat([a,b]) # axis=0 is the default
ValueError: Plan shapes are not aligned
Therefore pd.concat needs unique indices along whichever axis it is operating upon. You can't have two identical column names when you normally concatenate row wise, and likewise you can't be able to do it column wise.
Interestingly, for your original example, pd.concat([a, b], ignore_index=True, axis=1) also raises the same error, leading me to more strongly suspect that pandas is checking the shape before the concatenation.

Python structured numpy array multiple sort

Hello all I have a list of delimiter separated strings:
lists=['1|Abra|23|43|0','2|Cadabra|15|18|0','3|Grabra|4|421|0','4|Lol|1|15|0']
I need to convert it to numpy array than sort it just like excel do first by Column 3, then by Column 2, and finaly by the last column
Ive tried this:
def man():
a = np.array(lists[0].split('|'))
for line in lists:
temp = np.array(line.split('|'),)
a=np.concatenate((a, temp))
a.sort(order=[0, 1])
man()
Of course no luck because it is wrong! Unfortunately im not strong in numpy arrays. Can somebody help me pls? :(
This works just perfect for me but here numpy builds array from file so to make it work i've write my list of strings to file than read it and convert to array
import numpy as np
# let numpy guess the type with dtype=None
my_data = np.genfromtxt('Selector/tmp.txt',delimiter='|', dtype=None, names ["Num", "Date", "Desc", "Rgh" ,"Prc", "Color", "Smb", "MType"])
my_data.sort(order=["Color","Prc", "Rgh"])
# save specifying required format (tab separated values)
print(my_data)
How to remain everything as is but change the conversion function to make it build the same array not from file but from list
There may be better solutions, but for a start I would sort the array once by each column in reverse order.
I assume you want to sort by column 3 and ties are resolved by column 2. Finally, remaining ties are resolved by the last column. Thus, you'd actually sort by the last column first, then by 2, then by 3.
Furthermore, you can easily convert the list to an array using a list comprehension.
import numpy as np
lists=['1|Abra|23|43|0','2|Cadabra|15|18|0','3|Grabra|4|421|0','4|Lol|1|15|0']
# convert to numpy array by splitting each row
a = np.array([l.split('|') for l in lists])
# specify columns to sort by, in order
sort_cols = [3, 2, -1]
# sort by columns in reverse order.
# This only works correctly if the sorting algorithm is stable.
for sc in sort_cols[::-1]:
order = np.argsort(a[:, sc])
a = a[order]
print(a)
You can use a list comprehension in order to split your strings and convert the integers to int. Then use a proper dtype to create your numpy array then use np.sort() function by passing the expected order:
>>> dtype = [('1st', int), ('2nd', '|S7'), ('3rd', int), ('4th', int), ('5th', int)]
>>>
>>> a = np.array([tuple([int(i) if i.isdigit() else i for i in sub.split('|')]) for sub in delimit_strs], dtype=dtype)
>>> np.sort(a, axis=0, order=['3rd','2nd', '5th'])
array([(4, 'Lol', 1, 15, 0), (3, 'Grabra', 4, 421, 0),
(2, 'Cadabra', 15, 18, 0), (1, 'Abra', 23, 43, 0)],
dtype=[('1st', '<i8'), ('2nd', 'S7'), ('3rd', '<i8'), ('4th', '<i8'), ('5th', '<i8')])
You can also do this in python which for shorter data sets in more optimized. You can simple use sorted() function by passing a proper key function.
from operator import itemgetter
sorted([[int(i) if i.isdigit() else i for i in sub.split('|')]) for sub in delimit_strs], key=itemgetter(3, 2, 4))

Categories

Resources