I have similar question to:
Convert COO to CSR format in c++
but in python.
I don't want to use SciPy.
COO format:
row_index col_index value
1 1 1
1 2 -1
1 3 -3
2 1 -2
2 2 5
3 3 4
3 4 6
3 5 4
4 1 -4
4 3 2
4 4 7
5 2 8
5 5 -5
Desired output:
row_index col_index value
0 1 1
2 2 -1
4 3 -3
7 1 -2
8 2 5
3 4
4 6
5 4
1 -4
3 2
4 7
2 8
5 -5
I figured out how to execute it in python:
nnz= len(value)
rows= max (row_index)+1
csr_row=[0]*(rows+1)
for i in range(nnz):
csr_row[coo_row[i]+1]=csr_row[coo_row[i]+1]+1
for i in range(rows):
csr_row[i+1]=csr_row[i+1]+csr_row[i]
print("after: " , csr_row) # this helps the user follow along
Output:
after: [0, 2, 1, 3]
after: [0, 2, 3, 3]
after: [0, 2, 3, 6]
Related
Let's consider the following dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame([1, 2, 3, 4, 3, 2 , 5, 6, 4, 2, 1, 6])
I want to do the following thing: If i-th element of the dataframe is bigger than mean of two next, then we assign 1, and if not, we assign -1 to this ith element.
My solution
An obvious solution is the following:
df_copy = df.copy()
for i in range(len(df) - 2):
if (df.iloc[i] > np.mean(df.iloc[(i+1):(i+2)]))[0]:
df_copy.iloc[i] = 1
else:
df_copy.iloc[i] = -1
However, I find it little cumbersome, and I'm wondering if there is any loop-free solution to these kind of problems.
Desired output
0
0 -1
1 -1
2 -1
3 1
4 1
5 -1
6 -1
7 1
8 1
9 1
10 1
11 6
You can use a rolling.mean and shift:
df['out'] = np.where(df[0].gt(df[0].rolling(2).mean().shift(-2)), 1, -1)
output:
0 out
0 1 -1
1 2 -1
2 3 -1
3 4 1
4 3 -1
5 2 -1
6 5 -1
7 6 1
8 4 1
9 2 -1
10 1 -1
11 6 -1
keeping last items unchanged:
m = df[0].rolling(2).mean().shift(-2)
df['out'] = np.where(df[0].gt(m), 1, -1)
df['out'] = df['out'].mask(m.isna(), df[0])
output:
0 out
0 1 -1
1 2 -1
2 3 -1
3 4 1
4 3 -1
5 2 -1
6 5 -1
7 6 1
8 4 1
9 2 -1
10 1 1
11 6 6
I have this problem where i get an console input that is not an array/matrix, it's just a bunch of numbers per line separated by a space. How do i turn them in a actual array?
OBS about the input (the input it's not an txt or csv, it's directly the terminal inputing me numbers separated by space)
My given input is like this:
4 3 2 1 3 4 2 2 3 3 2 1 2 3
4 4 1 3 2 1 4 2 4 1 1 1 4 3
4 4 4 1 2 2 3 4 2 1 1 2 1 1
1 3 2 2 2 3 2 3 3 4 3 2 4 1
1 3 3 2 2 2 4 4 3 2 1 4 4 3
2 1 4 3 2 2 2 4 3 1 2 3 1 1
4 1 3 4 2 4 3 3 2 4 2 2 1 1
1 1 4 3 4 1 3 3 2 2 1 1 3 1
4 2 1 1 3 3 1 2 3 2 2 1 2 3
And for now i'm doing this:
string_data = input()
arr = [[int(num) for num in sublist.split()] for sublist in string_data.split('\n') if len(sublist)>0]
print(arr)
This solution kinda works.. but only for the first line, so the input from the code above will be:
[4, 3, 2, 1, 3, 4, 2, 2, 3, 3, 2, 1, 2, 3]
I need to find a way to merge this into an full array without creating the code that i wrote for each line. My main problem is to turn my input into an array so that I can add every time the number 2 appears.... so if it appears 5 times, the program needs to read it and return me the number 10
Use np.fromstring
import numpy as np
string_data = str(input())
np.fromstring(string_data, dtype=int, sep="\n")
Let's say I have a DF with 5 columns and I want to make a unique 'key' for each row.
a b c d e
1 1 2 3 4 5
2 1 2 3 4 6
3 1 2 3 4 7
4 1 2 2 5 6
5 2 3 4 5 6
6 2 3 4 5 6
7 3 4 5 6 7
I'd like to create a 'key' column as follows:
a b c d e key
1 1 2 3 4 5 12345
2 1 2 3 4 6 12346
3 1 2 3 4 7 12347
4 1 2 2 5 6 12256
5 2 3 4 5 6 23456
6 2 3 4 5 6 23456
7 3 4 5 6 7 34567
Now the problem with this of course is that row 5 & 6 are duplicates.
I'd like to be able to create unique keys like so:
a b c d e key
1 1 2 3 4 5 12345_1
2 1 2 3 4 6 12346_1
3 1 2 3 4 7 12347_1
4 1 2 2 5 6 12256_1
5 2 3 4 5 6 23456_1
6 2 3 4 5 6 23456_2
7 3 4 5 6 7 34567_1
Not sure how to do this or if this is the best method - appreciate any help.
Thanks
Edit: Columns will be mostly strings, not numeric.
On way is to hash to tuple of each row:
In [11]: df.apply(lambda x: hash(tuple(x)), axis=1)
Out[11]:
1 -2898633648302616629
2 -2898619338595901633
3 -2898621714079554433
4 -9151203046966584651
5 1657626630271466437
6 1657626630271466437
7 3771657657075408722
dtype: int64
In [12]: df['key'] = df.apply(lambda x: hash(tuple(x)), axis=1)
In [13]: df['key'].astype(str) + '_' + (df.groupby('key').cumcount() + 1).astype(str)
Out[13]:
1 -2898633648302616629_1
2 -2898619338595901633_1
3 -2898621714079554433_1
4 -9151203046966584651_1
5 1657626630271466437_1
6 1657626630271466437_2
7 3771657657075408722_1
dtype: object
Note: Generally you don't need to be doing this (it's unclear why you'd want to!).
try this.,
df['key']=df.apply(lambda x:'-'.join(x.values.tolist()),axis=1)
m=~df['key'].duplicated()
s= (df.groupby(m.cumsum()).cumcount()+1).astype(str)
df['key']=df['key']+'_'+s
print (df)
O/P:
a b c d e key
0 1 2 3 4 5 1-2-3-4-5_0
1 1 2 3 4 6 1-2-3-4-6_0
2 1 2 3 4 7 1-2-3-4-7_0
3 1 2 2 5 6 1-2-2-5-6_0
4 2 3 4 5 6 2-3-4-5-6_0
5 2 3 4 5 6 2-3-4-5-6_1
6 3 4 5 6 7 3-4-5-6-7_0
7 1 2 3 4 5 1-2-3-4-5_1
Another much simpler way:
df['key']=df['key']+'_'+(df.groupby('key').cumcount()).astype(str)
Explanation:
first create your unique id using join.
create a sequence s using duplicate and perform cumsum, restart when new value found.
finally concat key and your sequence s.
Maybe you can do something link the following
import uuid
df['uuid'] = [uuid.uuid4() for __ in range(df.index.size)]
Another approach would be to use np.random.choice(range(10000,99999), len(df), replace=False) to generate unique random numbers without replacement for each row in your df:
df = pd.DataFrame(columns = ['a', 'b', 'c', 'd', 'e'],
data = [[1, 2, 3, 4, 5],[1, 2, 3, 4, 6],[1, 2, 3, 4, 7],[1, 2, 2, 5, 6],[2, 3, 4, 5, 6],[2, 3, 4, 5, 6],[3, 4, 5, 6, 7]])
df['key'] = np.random.choice(range(10000,99999), len(df), replace=False)
df
a b c d e key
0 1 2 3 4 5 10560
1 1 2 3 4 6 79547
2 1 2 3 4 7 24762
3 1 2 2 5 6 95221
4 2 3 4 5 6 79460
5 2 3 4 5 6 62820
6 3 4 5 6 7 82964
Suppose that we have a numpy 2d array (or a Pandas DataFrame) with variable lengths in both rows and columns.
Is there a quick way to inspect all elements and clip to the pre-specified max value (if any element is larger than the pre-specified max value) in either numpy ndarray or pandas DataFrame, whichever is simpler?
pandas - use DataFrame.clip_upper:
np.random.seed(2018)
df = pd.DataFrame(np.random.randint(10, size=(5,5)))
print (df)
0 1 2 3 4
0 6 2 9 5 4
1 6 9 9 7 9
2 6 6 1 0 6
3 5 6 7 0 7
4 8 7 9 4 8
print (df.clip_upper(5))
0 1 2 3 4
0 5 2 5 5 4
1 5 5 5 5 5
2 5 5 1 0 5
3 5 5 5 0 5
4 5 5 5 4 5
Numpy - use numpy.clip:
np.random.seed(2018)
arr = np.random.randint(10, size=(5,5))
print (arr)
[[6 2 9 5 4]
[6 9 9 7 9]
[6 6 1 0 6]
[5 6 7 0 7]
[8 7 9 4 8]]
print (np.clip(arr, arr.min(), 5))
[[5 2 5 5 4]
[5 5 5 5 5]
[5 5 1 0 5]
[5 5 5 0 5]
[5 5 5 4 5]]
I have a two dimensional numpy array:
arr = np.array([[1,2,3],[4,5,6],[7,8,9]])
How would I go about converting this into a pandas data frame that would have the x coordinate, y coordinate, and corresponding array value at that index into a pandas data frame like this:
x y val
0 0 1
0 1 4
0 2 7
1 0 2
1 1 5
1 2 8
...
With stack and reset index:
df = pd.DataFrame(arr).stack().rename_axis(['y', 'x']).reset_index(name='val')
df
Out:
y x val
0 0 0 1
1 0 1 2
2 0 2 3
3 1 0 4
4 1 1 5
5 1 2 6
6 2 0 7
7 2 1 8
8 2 2 9
If ordering is important:
df.sort_values(['x', 'y'])[['x', 'y', 'val']].reset_index(drop=True)
Out:
x y val
0 0 0 1
1 0 1 4
2 0 2 7
3 1 0 2
4 1 1 5
5 1 2 8
6 2 0 3
7 2 1 6
8 2 2 9
Here's a NumPy method -
>>> arr
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
>>> shp = arr.shape
>>> r,c = np.indices(shp)
>>> pd.DataFrame(np.c_[r.ravel(), c.ravel(), arr.ravel('F')], \
columns=((['x','y','val'])))
x y val
0 0 0 1
1 0 1 4
2 0 2 7
3 1 0 2
4 1 1 5
5 1 2 8
6 2 0 3
7 2 1 6
8 2 2 9