i am trying to iterate over a time series with multiple columns and go through the columns to check if the values within the columns are motonic_increasing or decreasing.
The underlying issue is that I don't know how to iterate over the dataframe columns and treat the values as a list to allow is_monotonic_increasing to work.
I have a dataset that looks like this:
Id 10000T 20000T
2020-04-30 0 7
2020-05-31 3 5
2020-06-30 5 6
and I have tried doing this:
trend_observation_period = new_df[-3:] #the dataset
trend = np.where((trend_observation_period.is_monotonic_increasing()==True), 'go', 'nogo')
which gives me the error:
AttributeError: 'DataFrame' object has no attribute 'is_monotonic_increasing'
I am confused because I though that np.where would iterate over the columns and read them as np arrays. I have also tried this which does not work either.
for i in trend_observation_period.iteritems():
s = pd.Series(i)
trend = np.where((s.is_monotonic_increasing()==True | s.is_monotonic_decreasing()==True),
'trending', 'not_trending')
It sounds like you're after something which will iterate columns and test if each column is monotonic. See if this puts you on the right track.
Per the pandas docs .is_monotonic is the same as .is_monotonic_increasing.
Example:
# Sample dataset setup.
df = pd.DataFrame({'a': [1, 1, 1, 2],
'b': [3, 2, 1, 0],
'c': [0, 1, 1, 0],
'd': [2, 0, 1, 0]})
# Loop through each column in the DataFrame and output if monotonic.
for c in df:
print(f'Column: {c} I', df[c].is_monotonic)
print(f'Column: {c} D', df[c].is_monotonic_decreasing, end='\n\n')
Output:
Column: a I True
Column: a D False
Column: b I False
Column: b D True
Column: c I False
Column: c D False
Column: d I False
Column: d D False
You can use DataFrame.apply to apply a function to each of your columns. Since is_monotonic_increasing is a property of a Series and not a method of it, you'll need to wrap it in a function (you can use lambda for this):
df = pd.DataFrame({'a': [1, 1, 1, 1],
'b': [1, 1, 1, 0],
'c': [0, 1, 1, 0],
'd': [0, 0, 0, 0]})
increasing_cols = df.apply(lambda s: s.is_monotonic_increasing)
print(increasing_cols)
a True
b False
c False
d True
dtype: bool
Use .apply and is_monotonic.
Example :
import pandas as pd
df = pd.DataFrame({"A":[1, 2, 3, 4],
"B":[0, 1, 0, 1],
"C":[3, 5, 8, 9],
"D":[1, 2, 2, 1]})
df.apply(lambda x:x.is_monotonic)
A True
B False
C True
D False
dtype: bool
Related
I have a DataFrame, let's say:
#d = {'col1': [1, 2, 3], 'col2': [3, 4, 5]} // that's what the data might look like
df = pd.DataFrame(data=d)
and I have a np array with [0, 2].
Now I want to add a column to the DataFrame, where there is a 1, when the index of the row is in the np array, otherwise a 0.
Does anyone have an idea?
Use Index.isin with cast mask to integers:
d = {'col1': [1, 2, 3], 'col2': [3, 4, 5]}
df = pd.DataFrame(data=d)
a = np.array([0, 2])
df['new'] = df.index.isin(a).astype(int)
#alternative
#df['new'] = np.in1d(df.index, a).astype(int)
Or use numpy.where:
df['new'] = np.where(df.index.isin(a), 1, 0)
#alternative
#df['new'] = np.where(np.in1d(df.index, a), 1, 0)
print (df)
col1 col2 new
0 1 3 1
1 2 4 0
2 3 5 1
I would like to merge two dataframes. Both have the same column names, but different numbers of rows.
The values from the smaller dataframe should then replace the values from the other dataframe
So far I tried using pd.merge
pd.merge(df1, df2, how='left', on='NodeID)
But I do not know how to tell the merge command to use the values from the right dataframe for the columnes 'X' and 'Y'.
df1 = pd.DataFrame(data={'NodeID': [1, 2, 3, 4, 5], 'X': [0, 0, 0, 0, 0], 'Y': [0, 0, 0, 0, 0]})
df2 = pd.DataFrame(data={'NodeID': [2, 4], 'X': [1, 1], 'Y': [1, 1]})
The result should then look like this:
df3 = pd.DataFrame(data={'NodeID': [1, 2, 3, 4, 5], 'X': [0, 1, 0, 1, 0], 'Y':[0, 1, 0, 1, 0]})
This is can be done with concat and drop_duplicates
pd.concat([df2,df1]).drop_duplicates('NodeID').sort_values('NodeID')
Out[763]:
NodeID X Y
0 1 0 0
0 2 1 1
2 3 0 0
1 4 1 1
4 5 0 0
Given a data frame df:
Column A: [0, 1, 3, 4, 6]
Column B: [0, 0, 0, 0, 0]
The goal is to conditionally replace values in column B. If column A's values exist in a set assginedToA, we replace the corresponding values in column B with a constant b.
For example: if b=1 and assignedToA={1,4}, the result would be
Column A: [0, 1, 3, 4, 6]
Column B: [0, 1, 0, 1, 0]
My code for finding the A values and write B values into it looks like this:
df.loc[df['A'].isin(assignedToA),'B']=b
This code works, but it is really slow for a huge dataframe.
Do you have any advice, how to speed this process up?
The dataframe df has around 5 Million rows and assignedToA has a maximum of 7 values.
You may find a performance improvement by dropping down to numpy:
df = pd.DataFrame({'A': [0, 1, 3, 4, 6],
'B': [0, 0, 0, 0, 0]})
def jp(df, vals, k):
B = df['B'].values
B[np.in1d(df['A'], list(vals))] = k
df['B'] = B
return df
def original(df, vals, k):
df.loc[df['A'].isin(vals),'B'] = k
return df
df = pd.concat([df]*100000)
%timeit jp(df, {1, 4}, 1) # 8.55ms
%timeit original(df, {1, 4}, 1) # 16.6ms
I have two dataframes, each one having a lot of columns and rows. The elements in each row are the same, but their indexing is different. I want to add the elements of one of the columns of the two dataframes.
As a basic example consider the following two Series:
Sr1 = pd.Series([1,2,3,4], index = [0, 1, 2, 3])
Sr2 = pd.Series([3,4,-3,6], index = [1, 2, 3, 4])
Say that each row contains the same element, only in different indexing. I want to add the two columns and get in the end a new column that contains [4,6,0,10]. Instead, due to the indices, I get [nan, 5, 7, 1].
Is there an easy way to solve this without changing the indices?
I want output as a series.
You could use reset_index(drop=True):
Sr1 = pd.Series([1,2,3,4], index = [0, 1, 2, 3])
Sr2 = pd.Series([3,4,-3,6], index = [1, 2, 3, 4])
Sr1 + Sr2.reset_index(drop=True)
0 4
1 6
2 0
3 10
dtype: int64
Also,
pd.Series(Sr1.values + Sr2.values, index=Sr1.index)
One way is to use reset_index on one or more series:
Sr1 = pd.Series([1,2,3,4], index = [0, 1, 2, 3])
Sr2 = pd.Series([3,4,-3,6], index = [1, 2, 3, 4])
res = Sr1 + Sr2.reset_index(drop=True)
0 4
1 6
2 0
3 10
dtype: int64
Using zip
Ex:
import pandas as pd
Sr1 = pd.Series([1,2,3,4], index = [0, 1, 2, 3])
Sr2 = pd.Series([3,4,-3,6], index = [1, 2, 3, 4])
sr3 = [sum(i) for i in zip(Sr1, Sr2)]
print(sr3)
Output:
[4, 6, 0, 10]
You could use the .values, which gives you a numpy representation, and then you can add them like this:
Sr1.values + Sr2.values
The pandas factorize function assigns each unique value in a series to a sequential, 0-based index, and calculates which index each series entry belongs to.
I'd like to accomplish the equivalent of pandas.factorize on multiple columns:
import pandas as pd
df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]})
pd.factorize(df)[0] # would like [0, 1, 2, 2, 1, 0]
That is, I want to determine each unique tuple of values in several columns of a data frame, assign a sequential index to each, and compute which index each row in the data frame belongs to.
Factorize only works on single columns. Is there a multi-column equivalent function in pandas?
You need to create a ndarray of tuple first, pandas.lib.fast_zip can do this very fast in cython loop.
import pandas as pd
df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]})
print pd.factorize(pd.lib.fast_zip([df.x, df.y]))[0]
the output is:
[0 1 2 2 1 0]
I am not sure if this is an efficient solution. There might be better solutions for this.
arr=[] #this will hold the unique items of the dataframe
for i in df.index:
if list(df.iloc[i]) not in arr:
arr.append(list(df.iloc[i]))
so printing the arr would give you
>>>print arr
[[1,1],[1,2],[2,2]]
to hold the indices, i would declare an ind array
ind=[]
for i in df.index:
ind.append(arr.index(list(df.iloc[i])))
printing ind would give
>>>print ind
[0,1,2,2,1,0]
You can use drop_duplicates to drop those duplicated rows
In [23]: df.drop_duplicates()
Out[23]:
x y
0 1 1
1 1 2
2 2 2
EDIT
To achieve your goal, you can join your original df to the drop_duplicated one:
In [46]: df.join(df.drop_duplicates().reset_index().set_index(['x', 'y']), on=['x', 'y'])
Out[46]:
x y index
0 1 1 0
1 1 2 1
2 2 2 2
3 2 2 2
4 1 2 1
5 1 1 0
df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]})
tuples = df[['x', 'y']].apply(tuple, axis=1)
df['newID'] = pd.factorize( tuples )[0]