The pandas factorize function assigns each unique value in a series to a sequential, 0-based index, and calculates which index each series entry belongs to.
I'd like to accomplish the equivalent of pandas.factorize on multiple columns:
import pandas as pd
df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]})
pd.factorize(df)[0] # would like [0, 1, 2, 2, 1, 0]
That is, I want to determine each unique tuple of values in several columns of a data frame, assign a sequential index to each, and compute which index each row in the data frame belongs to.
Factorize only works on single columns. Is there a multi-column equivalent function in pandas?
You need to create a ndarray of tuple first, pandas.lib.fast_zip can do this very fast in cython loop.
import pandas as pd
df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]})
print pd.factorize(pd.lib.fast_zip([df.x, df.y]))[0]
the output is:
[0 1 2 2 1 0]
I am not sure if this is an efficient solution. There might be better solutions for this.
arr=[] #this will hold the unique items of the dataframe
for i in df.index:
if list(df.iloc[i]) not in arr:
arr.append(list(df.iloc[i]))
so printing the arr would give you
>>>print arr
[[1,1],[1,2],[2,2]]
to hold the indices, i would declare an ind array
ind=[]
for i in df.index:
ind.append(arr.index(list(df.iloc[i])))
printing ind would give
>>>print ind
[0,1,2,2,1,0]
You can use drop_duplicates to drop those duplicated rows
In [23]: df.drop_duplicates()
Out[23]:
x y
0 1 1
1 1 2
2 2 2
EDIT
To achieve your goal, you can join your original df to the drop_duplicated one:
In [46]: df.join(df.drop_duplicates().reset_index().set_index(['x', 'y']), on=['x', 'y'])
Out[46]:
x y index
0 1 1 0
1 1 2 1
2 2 2 2
3 2 2 2
4 1 2 1
5 1 1 0
df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]})
tuples = df[['x', 'y']].apply(tuple, axis=1)
df['newID'] = pd.factorize( tuples )[0]
Related
Suppose I have two pandas data frames, one actually more like a series
import pandas as pd
import numpy as np
A = pd.DataFrame(index=[0, 1, 2], data=[[1, 2, 3], [4, 5, 6] ,[7,8,9]],columns=["I", "L","P"])
B = pd.DataFrame(index=[1, 3, 4], data=[[10], [40] ,[70]])
I would like to add a new column to A, called "B" with values depending on the index. That means if the index element is shared on both, A and B, then the value of that row (corresponding to that index) of B should be added. Otherwise 0. The result should look like this
A = pd.DataFrame(index=[0, 1, 2], data=[[1, 2, 3,0], [4, 5, 6,10] ,[7,8,9,0]],columns=["I", "L","P","B"])
A
How can this be achieved efficiently in Python / pandas?
reindex with assign
A['B'] = B[0].reindex(A.index,fill_value=0)
A
Out[55]:
I L P B
0 1 2 3 0
1 4 5 6 10
2 7 8 9 0
I am no data scientist. I do know python and I currently have to manage time series data that is coming in at a regular interval. Much of this data is all zero's or values that are the same for a long time, and to save memory I'd like to filter them out. Is there some standard method for this (which I am obviously unaware of) or should I implement my own algorithm ?
What I want to achieve is the following:
interval value result
(summed)
1 0 0
2 0 # removed
3 0 0
4 1 1
5 2 2
6 2 # removed
7 2 # removed
8 2 2
9 0 0
10 0 0
Any help appreciated !
You could use pandas query on dataframes to achieve this:
import pandas as pd
matrix = [[1,0, 0],
[2, 0, 0],
[3, 0, 0],
[4, 1, 1],
[5, 2, 2],
[6, 2, 0],
[7, 2, 0],
[8, 2, 2],
[9, 0, 0],
[10,0, 0]]
df = pd.DataFrame(matrix, columns=list('abc'))
print(df.query("c != 0"))
There is no quick function call to do what you need. The following is one way
import pandas as pd
df = pd.DataFrame({'interval':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'value':[0, 0, 0, 1, 2, 2, 2, 2, 0, 0]}) # example dataframe
df['group'] = df['value'].ne(df['value'].shift()).cumsum() # column that increments every time the value changes
df['key'] = 1 # create column of ones
df['key'] = df.groupby('group')['key'].transform('cumsum') # get the cumulative sum
df['key'] = df.groupby('group')['key'].transform(lambda x: x.isin( [x.min(), x.max()])) # check which key is minimum and which is maximum by group
df = df[df['key']==True].drop(columns=['group', 'key']) # keep only relevant cases
df
Here is the code:
l = [0, 0, 0, 1, 2, 2, 2, 2, 0, 0]
for (i, ll) in enumerate(l):
if i != 0 and ll == l[i-1] and i<len(l)-1 and ll == l[i+1]:
continue
print(i+1, ll)
It produces what you want. You haven't specified format of your input data, so I assumed they're in a list. The conditions ll == l[i-1] and ll == l[i+1] are key to skipping the repeated values.
Thanks all!
Looking at the answers I guess I can conclude I'll need to roll my own. I'll be using your input as inspiration.
Thanks again !
I have a bunch of pandas.DataFrames which are each a table of counts. I'd like to sum them up to return the totals. I.e. cellwise, as if I was summing two dimensional histograms or two dimensional arrays. The output should be a table of the same dimensions as input, but with numerical values summed up.
To make matters worse, the order of the columns may not be preserved.
There must be a cool way to do this without looping, but I can't figure it out.
Here's an example:
import pandas
df1 = pandas.DataFrame({'A': [3, 1, 2], 'B': [1, 1, 0]})
df2 = pandas.DataFrame({'B': [2, 0, 1], 'A': [4, 1, 6]})
I'm looking for a function something like:
df_cellwise_sum = cellwise_sum(df1, df2)
print(df_cellwise_sum)
which makes:
A B
0 7 3
1 2 1
2 8 1
Use DataFrame.add:
df = df1.add(df2)
I have a DataFrame, let's say:
#d = {'col1': [1, 2, 3], 'col2': [3, 4, 5]} // that's what the data might look like
df = pd.DataFrame(data=d)
and I have a np array with [0, 2].
Now I want to add a column to the DataFrame, where there is a 1, when the index of the row is in the np array, otherwise a 0.
Does anyone have an idea?
Use Index.isin with cast mask to integers:
d = {'col1': [1, 2, 3], 'col2': [3, 4, 5]}
df = pd.DataFrame(data=d)
a = np.array([0, 2])
df['new'] = df.index.isin(a).astype(int)
#alternative
#df['new'] = np.in1d(df.index, a).astype(int)
Or use numpy.where:
df['new'] = np.where(df.index.isin(a), 1, 0)
#alternative
#df['new'] = np.where(np.in1d(df.index, a), 1, 0)
print (df)
col1 col2 new
0 1 3 1
1 2 4 0
2 3 5 1
I have two dataframes, each one having a lot of columns and rows. The elements in each row are the same, but their indexing is different. I want to add the elements of one of the columns of the two dataframes.
As a basic example consider the following two Series:
Sr1 = pd.Series([1,2,3,4], index = [0, 1, 2, 3])
Sr2 = pd.Series([3,4,-3,6], index = [1, 2, 3, 4])
Say that each row contains the same element, only in different indexing. I want to add the two columns and get in the end a new column that contains [4,6,0,10]. Instead, due to the indices, I get [nan, 5, 7, 1].
Is there an easy way to solve this without changing the indices?
I want output as a series.
You could use reset_index(drop=True):
Sr1 = pd.Series([1,2,3,4], index = [0, 1, 2, 3])
Sr2 = pd.Series([3,4,-3,6], index = [1, 2, 3, 4])
Sr1 + Sr2.reset_index(drop=True)
0 4
1 6
2 0
3 10
dtype: int64
Also,
pd.Series(Sr1.values + Sr2.values, index=Sr1.index)
One way is to use reset_index on one or more series:
Sr1 = pd.Series([1,2,3,4], index = [0, 1, 2, 3])
Sr2 = pd.Series([3,4,-3,6], index = [1, 2, 3, 4])
res = Sr1 + Sr2.reset_index(drop=True)
0 4
1 6
2 0
3 10
dtype: int64
Using zip
Ex:
import pandas as pd
Sr1 = pd.Series([1,2,3,4], index = [0, 1, 2, 3])
Sr2 = pd.Series([3,4,-3,6], index = [1, 2, 3, 4])
sr3 = [sum(i) for i in zip(Sr1, Sr2)]
print(sr3)
Output:
[4, 6, 0, 10]
You could use the .values, which gives you a numpy representation, and then you can add them like this:
Sr1.values + Sr2.values