I have two dataframes, each one having a lot of columns and rows. The elements in each row are the same, but their indexing is different. I want to add the elements of one of the columns of the two dataframes.
As a basic example consider the following two Series:
Sr1 = pd.Series([1,2,3,4], index = [0, 1, 2, 3])
Sr2 = pd.Series([3,4,-3,6], index = [1, 2, 3, 4])
Say that each row contains the same element, only in different indexing. I want to add the two columns and get in the end a new column that contains [4,6,0,10]. Instead, due to the indices, I get [nan, 5, 7, 1].
Is there an easy way to solve this without changing the indices?
I want output as a series.
You could use reset_index(drop=True):
Sr1 = pd.Series([1,2,3,4], index = [0, 1, 2, 3])
Sr2 = pd.Series([3,4,-3,6], index = [1, 2, 3, 4])
Sr1 + Sr2.reset_index(drop=True)
0 4
1 6
2 0
3 10
dtype: int64
Also,
pd.Series(Sr1.values + Sr2.values, index=Sr1.index)
One way is to use reset_index on one or more series:
Sr1 = pd.Series([1,2,3,4], index = [0, 1, 2, 3])
Sr2 = pd.Series([3,4,-3,6], index = [1, 2, 3, 4])
res = Sr1 + Sr2.reset_index(drop=True)
0 4
1 6
2 0
3 10
dtype: int64
Using zip
Ex:
import pandas as pd
Sr1 = pd.Series([1,2,3,4], index = [0, 1, 2, 3])
Sr2 = pd.Series([3,4,-3,6], index = [1, 2, 3, 4])
sr3 = [sum(i) for i in zip(Sr1, Sr2)]
print(sr3)
Output:
[4, 6, 0, 10]
You could use the .values, which gives you a numpy representation, and then you can add them like this:
Sr1.values + Sr2.values
Related
Suppose I have two pandas data frames, one actually more like a series
import pandas as pd
import numpy as np
A = pd.DataFrame(index=[0, 1, 2], data=[[1, 2, 3], [4, 5, 6] ,[7,8,9]],columns=["I", "L","P"])
B = pd.DataFrame(index=[1, 3, 4], data=[[10], [40] ,[70]])
I would like to add a new column to A, called "B" with values depending on the index. That means if the index element is shared on both, A and B, then the value of that row (corresponding to that index) of B should be added. Otherwise 0. The result should look like this
A = pd.DataFrame(index=[0, 1, 2], data=[[1, 2, 3,0], [4, 5, 6,10] ,[7,8,9,0]],columns=["I", "L","P","B"])
A
How can this be achieved efficiently in Python / pandas?
reindex with assign
A['B'] = B[0].reindex(A.index,fill_value=0)
A
Out[55]:
I L P B
0 1 2 3 0
1 4 5 6 10
2 7 8 9 0
I haven't found a simple solution to move elements in a NumPy array.
Given an array, for example:
>>> A = np.arange(10).reshape(2,5)
>>> A
array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])
and given the indexes of the elements (columns in this case) to move, for example [2,4], I want to move them to a certain position and the consecutive places, for example to p = 1, shifting the other elements to the right. The result should be the following:
array([[0, 2, 4, 1, 3],
[5, 7, 9, 6, 8]])
You can create a mask m for the sorting order. First we set the columns < p to -1, then the to be inserted columns to 0, the remaining columns remain at 1. The default sorting kind 'quicksort' is not stable, so to be safe we specify kind='stable' when using argsort to sort the mask and create a new array from that mask:
import numpy as np
A = np.arange(10).reshape(2,5)
p = 1
c = [2,4]
m = np.full(A.shape[1], 1)
m[:p] = -1 # leave up to position p as is
m[c] = 0 # insert columns c
print(A[:,m.argsort(kind='stable')])
#[[0 2 4 1 3]
# [5 7 9 6 8]]
I wish to extract values from a multiindex DataFrame, this df has two indexes, a_idx and b_idx. The values to be extracted are i.e. (1,1)
[in] df.loc[(1, 1), :]
[out] 0
Name: (1, 1), dtype: int64
which is as intended. But then if I want to obtain two values (1,2) and (2,3):
[in] df.loc[([1, 2], [2, 3]), :]
[out]
value
a_idx b_idx
1 2 1
3 6
2 2 3
3 9
Which is not what I wanted, I needed the specific pairs, not the 4 values.
Furthermore, I wish to select elements from this database with two arrays select_a and select_b: .loc[[, that have the same length as eachother, but not as the dataframe. So for
select_a = [1, 1, 2, 2, 3]
select_b = [1, 3, 2, 3, 1]
My gist was that I should do this using:
df.loc[(select_a, select_b), :]
and then receive a list of all items with a_idx==select_a[i] and b_idx==select_b[i] for all i in len(select_a).
I have tried xs and slice indexing, but this did not return the desired results. My main reason for going to the indexing method is because of computational speed, as the real dataset is actually 4.3 million lines and the dataset that has to be created will have even more.
If this is not the best way to achieve this result, then please point me in the right direction. Any sources are also welcome, what I found in the pandas documentation was not geared towards this kind of indexing (or at least I have not been able to find it)
The dataframe is created using the following code:
numbers = pd.DataFrame(np.random.randint(0,10,10), columns=["value"])
numbers["a"] = [1, 1, 1, 1, 2, 2, 2, 3, 3, 3]
numbers["b"] = [1, 2, 3, 4, 1, 2, 3, 1, 2, 3]
print("before adding the index to the dataframe")
print(numbers)
index_cols = pd.MultiIndex.from_arrays(
[numbers["a"].values, numbers["b"].values],
names=["a_idx", "b_idx"])
df = pd.DataFrame(numbers.values,
index=index_cols,
columns=numbers.columns.values)
df = df.sort_index()
df.drop(columns=["a","b"],inplace=True)
print("after adding the indexes to the dataframe")
print(df)
You were almost there. To get the pair for those indexes, you need to have the syntax like this:
df.loc[[(1, 2), (2, 3)], :]
You can also do this using select_a and select_b. Just make sure that you pass the pairs to df.loc as tuples.
I have a matrix, and I want to write a script to extract values which are bigger than zero, its row number and column number(because the value belongs to that (row, column)), and here's an example,
from numpy import *
import numpy as np
m=np.array([[0,2,4],[4,0,4],[5,4,0]])
index_row=[]
index_col=[]
dist=[]
I want to store the row number in index_row, the column number in index_col, and the value in dist. So in this case,
index_row = [0 0 1 1 2 2]
index_col = [1 2 0 2 0 1]
dist = [2 4 4 4 5 4]
How to add the codes to achieve this goal? Thanks for giving me suggestions.
You can use numpy.where for this:
>>> indices = np.where(m > 0)
>>> index_row, index_col = indices
>>> dist = m[indices]
>>> index_row
array([0, 0, 1, 1, 2, 2])
>>> index_col
array([1, 2, 0, 2, 0, 1])
>>> dist
array([2, 4, 4, 4, 5, 4])
Though this has been answered already, I often find np.where to be somewhat cumbersome-- though like all things, depends on the circumstance. For this, I'd probably use a zip and a list comprehension:
index_row = [0, 0, 1, 1, 2, 2]
index_col = [1, 2, 0, 2, 0, 1]
zipped = zip(index_row, index_col)
dist = [m[z] for z in zipped]
The zip will give you an iteratable of tuples, which can be used to index numpy arrays.
The pandas factorize function assigns each unique value in a series to a sequential, 0-based index, and calculates which index each series entry belongs to.
I'd like to accomplish the equivalent of pandas.factorize on multiple columns:
import pandas as pd
df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]})
pd.factorize(df)[0] # would like [0, 1, 2, 2, 1, 0]
That is, I want to determine each unique tuple of values in several columns of a data frame, assign a sequential index to each, and compute which index each row in the data frame belongs to.
Factorize only works on single columns. Is there a multi-column equivalent function in pandas?
You need to create a ndarray of tuple first, pandas.lib.fast_zip can do this very fast in cython loop.
import pandas as pd
df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]})
print pd.factorize(pd.lib.fast_zip([df.x, df.y]))[0]
the output is:
[0 1 2 2 1 0]
I am not sure if this is an efficient solution. There might be better solutions for this.
arr=[] #this will hold the unique items of the dataframe
for i in df.index:
if list(df.iloc[i]) not in arr:
arr.append(list(df.iloc[i]))
so printing the arr would give you
>>>print arr
[[1,1],[1,2],[2,2]]
to hold the indices, i would declare an ind array
ind=[]
for i in df.index:
ind.append(arr.index(list(df.iloc[i])))
printing ind would give
>>>print ind
[0,1,2,2,1,0]
You can use drop_duplicates to drop those duplicated rows
In [23]: df.drop_duplicates()
Out[23]:
x y
0 1 1
1 1 2
2 2 2
EDIT
To achieve your goal, you can join your original df to the drop_duplicated one:
In [46]: df.join(df.drop_duplicates().reset_index().set_index(['x', 'y']), on=['x', 'y'])
Out[46]:
x y index
0 1 1 0
1 1 2 1
2 2 2 2
3 2 2 2
4 1 2 1
5 1 1 0
df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]})
tuples = df[['x', 'y']].apply(tuple, axis=1)
df['newID'] = pd.factorize( tuples )[0]