I have a bunch of pandas.DataFrames which are each a table of counts. I'd like to sum them up to return the totals. I.e. cellwise, as if I was summing two dimensional histograms or two dimensional arrays. The output should be a table of the same dimensions as input, but with numerical values summed up.
To make matters worse, the order of the columns may not be preserved.
There must be a cool way to do this without looping, but I can't figure it out.
Here's an example:
import pandas
df1 = pandas.DataFrame({'A': [3, 1, 2], 'B': [1, 1, 0]})
df2 = pandas.DataFrame({'B': [2, 0, 1], 'A': [4, 1, 6]})
I'm looking for a function something like:
df_cellwise_sum = cellwise_sum(df1, df2)
print(df_cellwise_sum)
which makes:
A B
0 7 3
1 2 1
2 8 1
Use DataFrame.add:
df = df1.add(df2)
Related
Suppose I have two pandas data frames, one actually more like a series
import pandas as pd
import numpy as np
A = pd.DataFrame(index=[0, 1, 2], data=[[1, 2, 3], [4, 5, 6] ,[7,8,9]],columns=["I", "L","P"])
B = pd.DataFrame(index=[1, 3, 4], data=[[10], [40] ,[70]])
I would like to add a new column to A, called "B" with values depending on the index. That means if the index element is shared on both, A and B, then the value of that row (corresponding to that index) of B should be added. Otherwise 0. The result should look like this
A = pd.DataFrame(index=[0, 1, 2], data=[[1, 2, 3,0], [4, 5, 6,10] ,[7,8,9,0]],columns=["I", "L","P","B"])
A
How can this be achieved efficiently in Python / pandas?
reindex with assign
A['B'] = B[0].reindex(A.index,fill_value=0)
A
Out[55]:
I L P B
0 1 2 3 0
1 4 5 6 10
2 7 8 9 0
I wish to extract values from a multiindex DataFrame, this df has two indexes, a_idx and b_idx. The values to be extracted are i.e. (1,1)
[in] df.loc[(1, 1), :]
[out] 0
Name: (1, 1), dtype: int64
which is as intended. But then if I want to obtain two values (1,2) and (2,3):
[in] df.loc[([1, 2], [2, 3]), :]
[out]
value
a_idx b_idx
1 2 1
3 6
2 2 3
3 9
Which is not what I wanted, I needed the specific pairs, not the 4 values.
Furthermore, I wish to select elements from this database with two arrays select_a and select_b: .loc[[, that have the same length as eachother, but not as the dataframe. So for
select_a = [1, 1, 2, 2, 3]
select_b = [1, 3, 2, 3, 1]
My gist was that I should do this using:
df.loc[(select_a, select_b), :]
and then receive a list of all items with a_idx==select_a[i] and b_idx==select_b[i] for all i in len(select_a).
I have tried xs and slice indexing, but this did not return the desired results. My main reason for going to the indexing method is because of computational speed, as the real dataset is actually 4.3 million lines and the dataset that has to be created will have even more.
If this is not the best way to achieve this result, then please point me in the right direction. Any sources are also welcome, what I found in the pandas documentation was not geared towards this kind of indexing (or at least I have not been able to find it)
The dataframe is created using the following code:
numbers = pd.DataFrame(np.random.randint(0,10,10), columns=["value"])
numbers["a"] = [1, 1, 1, 1, 2, 2, 2, 3, 3, 3]
numbers["b"] = [1, 2, 3, 4, 1, 2, 3, 1, 2, 3]
print("before adding the index to the dataframe")
print(numbers)
index_cols = pd.MultiIndex.from_arrays(
[numbers["a"].values, numbers["b"].values],
names=["a_idx", "b_idx"])
df = pd.DataFrame(numbers.values,
index=index_cols,
columns=numbers.columns.values)
df = df.sort_index()
df.drop(columns=["a","b"],inplace=True)
print("after adding the indexes to the dataframe")
print(df)
You were almost there. To get the pair for those indexes, you need to have the syntax like this:
df.loc[[(1, 2), (2, 3)], :]
You can also do this using select_a and select_b. Just make sure that you pass the pairs to df.loc as tuples.
So I have a list of arrays in Python: [[0, 1, 0, 1], [1, 0, 1, 1], [0, 1, 1, 1]]. I would like to make this list of arrays into a Pandas dataframe, with each array being a row. Is there a way to do this quickly and easily in Python? I tried values = np.split(values, len(values)) to split the list of arrays into multiple arrays (well, I tried). And then tried to create a dataframe with df = pd.DataFrame(values). But this is where my error came from. I got a "must pass 2-d input" error message. Any idea what I'm doing wrong and how to fix it? Or an easier way to go about this? Thanks!
No need to do all that splitting, etc. If you have it as a list of lists that is two dimensional (meaning all rows have the same number of elements), you can simply pass it to the DataFrame constructor:
data = [[0, 1, 0, 1], [1, 0, 1, 1], [0, 1, 1, 1]]
pd.DataFrame(data)
generating the expected:
>>> pd.DataFrame(data)
0 1 2 3
0 0 1 0 1
1 1 0 1 1
2 0 1 1 1
I am relatively new to python, and a piece of existing code has created an object akin to per below. This is part of a legacy piece of code. i can unfortunately not change it. The code creates many objects that look like the following format:
[[{'a': 2,'b': 3}],[{'a': 1,'c': 3}],[{'c': 2,'d': 4}]]
I am trying to create transform this object into a matrix or numpy arrays. In this specific example - it would have three rows (1,2,3) and 4 columns (a,b,c,d), with the dictionary values inserted in the cells. (I have inserted how this matrix would look as a dinky toy example. However - i am not looking to recreate the table from scratch, but i am looking for code that translate the object per above in a matrix format).
I am struggling to find a fast and easy way to do this. Any tips or advice much appreciated.
a b c d
1 2 3 0 0
2 1 0 3 0
3 0 2 0 4
I suspect you are focusing on the fast and easy, when you need to address the how first. This isn't the normal input format for np.array or `pandas. So let's focus on that.
It's a list of lists; suggesting a 2d array. But each sublist contains one dictionary, not a list of values.
In [633]: dd=[[{'a': 2,'b': 3}],[{'a': 1,'c': 3}],[{'c': 2,'d': 4}]]
In [634]: dd[0]
Out[634]: [{'b': 3, 'a': 2}]
So let's define a function that converts a dictionary into a list of numbers. We can address the question of where a,b,c,d labels come from, and whether you need to collect them from dd or not, later.
In [635]: dd[0][0]
Out[635]: {'b': 3, 'a': 2}
In [636]: def mk_row(adict):
return [adict.get(k,0) for k in ['a','b','c','d']]
.....:
In [637]: mk_row(dd[0][0])
Out[637]: [2, 3, 0, 0]
So now we just need to apply the function to each sublist
In [638]: [mk_row(d[0]) for d in dd]
Out[638]: [[2, 3, 0, 0], [1, 0, 3, 0], [0, 0, 2, 4]]
This is the kind of list that #Colin fed to pandas. It can also be given to np.array:
In [639]: np.array([mk_row(d[0]) for d in dd])
Out[639]:
array([[2, 3, 0, 0],
[1, 0, 3, 0],
[0, 0, 2, 4]])
Simpy use:
import pandas as pd
df = pd.DataFrame.from_items([('1', [2, 3, 0,0]), ('2', [1, 0, 3,0]),('3', [0, 2, 0,4])], orient='index', columns=['a', 'b', 'c','d'])
arr = df.values
You can then reference it like a normal numpy array:
print(arr[0,:])
The pandas factorize function assigns each unique value in a series to a sequential, 0-based index, and calculates which index each series entry belongs to.
I'd like to accomplish the equivalent of pandas.factorize on multiple columns:
import pandas as pd
df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]})
pd.factorize(df)[0] # would like [0, 1, 2, 2, 1, 0]
That is, I want to determine each unique tuple of values in several columns of a data frame, assign a sequential index to each, and compute which index each row in the data frame belongs to.
Factorize only works on single columns. Is there a multi-column equivalent function in pandas?
You need to create a ndarray of tuple first, pandas.lib.fast_zip can do this very fast in cython loop.
import pandas as pd
df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]})
print pd.factorize(pd.lib.fast_zip([df.x, df.y]))[0]
the output is:
[0 1 2 2 1 0]
I am not sure if this is an efficient solution. There might be better solutions for this.
arr=[] #this will hold the unique items of the dataframe
for i in df.index:
if list(df.iloc[i]) not in arr:
arr.append(list(df.iloc[i]))
so printing the arr would give you
>>>print arr
[[1,1],[1,2],[2,2]]
to hold the indices, i would declare an ind array
ind=[]
for i in df.index:
ind.append(arr.index(list(df.iloc[i])))
printing ind would give
>>>print ind
[0,1,2,2,1,0]
You can use drop_duplicates to drop those duplicated rows
In [23]: df.drop_duplicates()
Out[23]:
x y
0 1 1
1 1 2
2 2 2
EDIT
To achieve your goal, you can join your original df to the drop_duplicated one:
In [46]: df.join(df.drop_duplicates().reset_index().set_index(['x', 'y']), on=['x', 'y'])
Out[46]:
x y index
0 1 1 0
1 1 2 1
2 2 2 2
3 2 2 2
4 1 2 1
5 1 1 0
df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]})
tuples = df[['x', 'y']].apply(tuple, axis=1)
df['newID'] = pd.factorize( tuples )[0]