Suppose that we have a numpy 2d array (or a Pandas DataFrame) with variable lengths in both rows and columns.
Is there a quick way to inspect all elements and clip to the pre-specified max value (if any element is larger than the pre-specified max value) in either numpy ndarray or pandas DataFrame, whichever is simpler?
pandas - use DataFrame.clip_upper:
np.random.seed(2018)
df = pd.DataFrame(np.random.randint(10, size=(5,5)))
print (df)
0 1 2 3 4
0 6 2 9 5 4
1 6 9 9 7 9
2 6 6 1 0 6
3 5 6 7 0 7
4 8 7 9 4 8
print (df.clip_upper(5))
0 1 2 3 4
0 5 2 5 5 4
1 5 5 5 5 5
2 5 5 1 0 5
3 5 5 5 0 5
4 5 5 5 4 5
Numpy - use numpy.clip:
np.random.seed(2018)
arr = np.random.randint(10, size=(5,5))
print (arr)
[[6 2 9 5 4]
[6 9 9 7 9]
[6 6 1 0 6]
[5 6 7 0 7]
[8 7 9 4 8]]
print (np.clip(arr, arr.min(), 5))
[[5 2 5 5 4]
[5 5 5 5 5]
[5 5 1 0 5]
[5 5 5 0 5]
[5 5 5 4 5]]
Related
I have a small subset of data here:
import pandas as pd
days = [1, 2, 3]
time = [2, 4, 2, 4, 2, 4, 2, 4, 2]
df1 = pd.DataFrame(days)
df2 = pd.Series(time)
df2 = df2.transpose()
df3 = df1*df2
Df1 is a column of data and df2 is a row of data. I need a dataframe that is going to be 3x9 where the row is multiplied by each value in the column to make one large dataframe.
The end result should look like:
df3 = [2 4 2 4 2 4 2 4 2
4 8 4 8 4 8 4 8 4
6 12 6 12 6 12 6 12 6 ]
They way I currently have it for my larger dataset, only a few datapoints are correctly multiplied and most are nans.
Dot(product) is one of the solutions to this problem
import pandas as pd
days = [1, 2, 3]
time = [2, 4, 2, 4, 2, 4, 2, 4, 2]
df1 = pd.DataFrame(days)
df2 = pd.DataFrame(time)
# use dot
df3 = df1.dot(df2.T)
df3
Output
0 1 2 3 4 5 6 7 8
0 2 4 2 4 2 4 2 4 2
1 4 8 4 8 4 8 4 8 4
2 6 12 6 12 6 12 6 12 6
Try this:
df1.dot(df2.to_frame().T)
Output:
0 1 2 3 4 5 6 7 8
0 2 4 2 4 2 4 2 4 2
1 4 8 4 8 4 8 4 8 4
2 6 12 6 12 6 12 6 12 6
For example. I have a 2D array:
1 2 3
4 5 6
7 8 9
And I want it becomes:
1 1 2 3 3
1 1 2 3 3
4 4 5 6 6
7 7 8 9 9
7 7 8 9 9
And then loop this process until the size becomes 9x9 2D array.
Thanks!
You're looking for numpy's repeat function :
initial_array = np.arange(1, 10).reshape((3,3))
desired_shape = (9, 9)
number_of_repeat_axis0 = desired_shape[0] // initial_array.shape[0]
number_of_repeat_axis1 = desired_shape[1] // initial_array.shape[1]
tmp = np.repeat(initial_array , number_of_repeat_axis0, axis = 0)
output = np.repeat(tmp, number_of_repeat_axis1, axis = 1)
'''
returns :
[[1 1 1 2 2 2 3 3 3]
[1 1 1 2 2 2 3 3 3]
[1 1 1 2 2 2 3 3 3]
[4 4 4 5 5 5 6 6 6]
[4 4 4 5 5 5 6 6 6]
[4 4 4 5 5 5 6 6 6]
[7 7 7 8 8 8 9 9 9]
[7 7 7 8 8 8 9 9 9]
[7 7 7 8 8 8 9 9 9]]
'''
But this will repeat all your data, including that in the middle of your array. If you only want the extremal values to be repeated, simply change it to :
tmp = np.repeat(initial_array , [4,1,4], axis = 0)
output = np.repeat(tmp, [4,1,4], axis = 1)
'''
returns :
[[1 1 1 1 2 3 3 3 3]
[1 1 1 1 2 3 3 3 3]
[1 1 1 1 2 3 3 3 3]
[1 1 1 1 2 3 3 3 3]
[4 4 4 4 5 6 6 6 6]
[7 7 7 7 8 9 9 9 9]
[7 7 7 7 8 9 9 9 9]
[7 7 7 7 8 9 9 9 9]
[7 7 7 7 8 9 9 9 9]]
'''
I want to swap all the values of my data frame.Largest value must be replaced with smallest value (i.e. 7 with 1, 6 with 2, 5 with 3, 4 with 4, 3 with 5, and so on..
import numpy as np
import pandas as pd
import io
data = '''
Values
6
1
3
7
5
2
4
1
4
7
2
5
'''
df = pd.read_csv(io.StringIO(data))
Trial
First I want to get all the unique values from my data.
df1=df.Values.unique()
print(df1)
[6 1 3 7 5 2 4]
I have sorted it in ascending order:
sorted1 = list(np.sort(df1))
print(sorted1)
[1, 2, 3, 4, 5, 6, 7]
Than I have reverse sorted the list:
rev_sorted = list(reversed(sorted1))
print(rev_sorted)
[7, 6, 5, 4, 3, 2, 1]
Now I need to replace the max. value with min. value and so on in my main data set (df). The old values can be replaced or a new column might be added.
Expected Output:
Values,New_Values
6,2
1,7
3,5
7,1
5,3
2,6
4,4
1,7
4,4
7,1
2,6
5,3
Here's a vectorized one -
In [51]: m,n = np.unique(df['Values'], return_inverse=True)
In [52]: df['New_Values'] = m[n.max()-n]
In [53]: df
Out[53]:
Values New_Values
0 6 2
1 1 7
2 3 5
3 7 1
4 5 3
5 2 6
6 4 4
7 1 7
8 4 4
9 7 1
10 2 6
11 5 3
Translating to pandas with pandas.factorize -
m,n = pd.factorize(df.Values, sort=True)
df['New_Values'] = n[m.max()-m]
Use Series.map by dictionary created by sorted and reverse sorting lists:
df['New'] = df['Values'].map(dict(zip(sorted1,rev_sorted)))
print (df)
Values New
0 6 2
1 1 7
2 3 5
3 7 1
4 5 3
5 2 6
6 4 4
7 1 7
8 4 4
9 7 1
10 2 6
11 5 3
This question already has answers here:
Add a sequential counter column on groups to a pandas dataframe
(4 answers)
Closed 3 years ago.
Suppose I have a dataframe looking something like
df = pd.DataFrame(np.array([[1, 2, 3, 2], [4, 5, 6, 3], [7, 8, 9, 5]]), columns=['a', 'b', 'c', 'repeater'])
a b c repeater
0 1 2 3 2
1 4 5 6 3
2 7 8 9 5
And I repeat every row based on the df['repeat'] like df = df.loc[df.index.repeat(df['repeater'])]
So I end up with a data frame
a b c repeater
0 1 2 3 2
0 1 2 3 2
1 4 5 6 3
1 4 5 6 3
1 4 5 6 3
2 7 8 9 5
2 7 8 9 5
2 7 8 9 5
2 7 8 9 5
2 7 8 9 5
How can I add an incremental value based on the index row? So a new column df['incremental'] with the output:
a b c repeater incremental
0 1 2 3 2 1
0 1 2 3 2 2
1 4 5 6 3 1
1 4 5 6 3 2
1 4 5 6 3 3
2 7 8 9 5 1
2 7 8 9 5 2
2 7 8 9 5 3
2 7 8 9 5 4
2 7 8 9 5 5
Try your code with an extra groupby and cumcount:
df = df.loc[df.index.repeat(df['repeater'])]
df['incremental'] = df.groupby(df.index).cumcount() + 1
print(df)
Output:
a b c repeater incremental
0 1 2 3 2 1
0 1 2 3 2 2
1 4 5 6 3 1
1 4 5 6 3 2
1 4 5 6 3 3
2 7 8 9 5 1
2 7 8 9 5 2
2 7 8 9 5 3
2 7 8 9 5 4
2 7 8 9 5 5
I want to split dataframe by uneven number of rows using row index.
The below code:
groups = df.groupby((np.arange(len(df.index))/l[1]).astype(int))
works only for uniform number of rows.
df
a b c
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
l = [2, 5, 7]
df1
1 1 1
2 2 2
df2
3,3,3
4,4,4
5,5,5
df3
6,6,6
7,7,7
df4
8,8,8
You could use list comprehension with a little modications your list, l, first.
print(df)
a b c
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
5 6 6 6
6 7 7 7
7 8 8 8
l = [2,5,7]
l_mod = [0] + l + [max(l)+1]
list_of_dfs = [df.iloc[l_mod[n]:l_mod[n+1]] for n in range(len(l_mod)-1)]
Output:
list_of_dfs[0]
a b c
0 1 1 1
1 2 2 2
list_of_dfs[1]
a b c
2 3 3 3
3 4 4 4
4 5 5 5
list_of_dfs[2]
a b c
5 6 6 6
6 7 7 7
list_of_dfs[3]
a b c
7 8 8 8
I think this is what you need:
df = pd.DataFrame({'a': np.arange(1, 8),
'b': np.arange(1, 8),
'c': np.arange(1, 8)})
df.head()
a b c
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
5 6 6 6
6 7 7 7
last_check = 0
dfs = []
for ind in [2, 5, 7]:
dfs.append(df.loc[last_check:ind-1])
last_check = ind
Although list comprehension are much more efficient than a for loop, the last_check is necessary if you don't have a pattern in your list of indices.
dfs[0]
a b c
0 1 1 1
1 2 2 2
dfs[2]
a b c
5 6 6 6
6 7 7 7
I think this is you are looking for.,
l = [2, 5, 7]
dfs=[]
i=0
for val in l:
if i==0:
temp=df.iloc[:val]
dfs.append(temp)
elif i==len(l):
temp=df.iloc[val]
dfs.append(temp)
else:
temp=df.iloc[l[i-1]:val]
dfs.append(temp)
i+=1
Output:
a b c
0 1 1 1
1 2 2 2
a b c
2 3 3 3
3 4 4 4
4 5 5 5
a b c
5 6 6 6
6 7 7 7
Another Solution:
l = [2, 5, 7]
t= np.arange(l[-1])
l.reverse()
for val in l:
t[:val]=val
temp=pd.DataFrame(t)
temp=pd.concat([df,temp],axis=1)
for u,v in temp.groupby(0):
print v
Output:
a b c 0
0 1 1 1 2
1 2 2 2 2
a b c 0
2 3 3 3 5
3 4 4 4 5
4 5 5 5 5
a b c 0
5 6 6 6 7
6 7 7 7 7
You can create an array to use for indexing via NumPy:
import pandas as pd, numpy as np
df = pd.DataFrame(np.arange(24).reshape((8, 3)), columns=list('abc'))
L = [2, 5, 7]
idx = np.cumsum(np.in1d(np.arange(len(df.index)), L))
for _, chunk in df.groupby(idx):
print(chunk, '\n')
a b c
0 0 1 2
1 3 4 5
a b c
2 6 7 8
3 9 10 11
4 12 13 14
a b c
5 15 16 17
6 18 19 20
a b c
7 21 22 23
Instead of defining a new variable for each dataframe, you can use a dictionary:
d = dict(tuple(df.groupby(idx)))
print(d[1]) # print second groupby value
a b c
2 6 7 8
3 9 10 11
4 12 13 14