my dataframe urm has a shape of (96438, 3)
user_id anime_id user_rating
0 1 20 7.808497
1 3 20 8.000000
2 5 20 6.000000
3 6 20 7.808497
4 10 20 7.808497
i'm trying to build an item-user-rating matrix :
X = urm[["user_id", "anime_id"]].as_matrix()
y = urm["user_rating"].values
n_u = len(urm["user_id"].unique())
n_m = len(urm["anime_id"].unique())
R = np.zeros((n_u, n_m))
for idx, row in enumerate(X):
R[row[0]-1, row[1]-1] = y[idx]
if the code succes the matrix looks like that : (i filled NaN with 0)
with in index user_id, anime_id in columns and rating for the value (i got this matrix from pivot_table)
is in some tutorial it works but there i got an
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-278-0e06bd0f3133> in <module>()
15 R = np.zeros((n_u, n_m))
16 for idx, row in enumerate(X):
---> 17 R[row[0]-1, row[1]-1] = y[idx]
IndexError: index 5276 is out of bounds for axis 1 with size 5143
I tried the second suggestion of dennlinger and it worked for me.
This was the code I wrote:
def id_to_index(df):
"""
maps the values to the lowest consecutive values
:param df: pandas Dataframe with columns user, item, rating
:return: pandas Dataframe with the extra columns index_item and index_user
"""
index_item = np.arange(0, len(df.item.unique()))
index_user = np.arange(0, len(df.user.unique()))
df_item_index = pd.DataFrame(df.item.unique(), columns=["item"])
df_item_index["new_index"] = index_item
df_user_index = pd.DataFrame(df.user.unique(), columns=["user"])
df_user_index["new_index"] = index_user
df["index_item"] = df["item"].map(df_item_index.set_index('item')["new_index"]).fillna(0)
df["index_user"] = df["user"].map(df_user_index.set_index('user')["new_index"]).fillna(0)
return df
I am assuming you have non-consecutive user IDs (or movie IDs), which means that there exist indices that either have
no rating, or
no movie
In your case, you are setting up your matrix dimensions with the assumption that every value will be consecutive (since you are defining the dimension with the amount of unique values), which causes some non-consecutive values to reach out of bounds.
In that case, you have two options:
You can define you matrix to be of size urm["user_id"].max() by urm["anime_id"].max()
Create a dictionary that maps your values to the lowest consecutive values.
The disadvantage of the first approach is obviously that it requires you to store a bigger matrix. Also, you can use scipy.sparse to create a matrix from the data format you have (commonly referred to as the coordinate matrix format).
Potentially, you can do something like this:
from scipy import sparse
# scipy expects the data in (value_column, (x, y))
mat = sparse.coo_matrix((urm["user_rating"], (urm["user_id"], urm["anime_id"]))
# if you want it as a dense matrix
dense_mat = mat.todense()
You can then also work your way to the second suggestion, as I have previously asked here
Related
I want to read a text file with values of matrix. Let's say you have got a .txt file looking like this:
0 0 4.0
0 1 5.2
0 2 2.1
1 0 2.1
1 1 2.9
1 2 3.1
Here, the first column gives the indices of the matrix on the x-axis and the second column fives the indices of the y-axis. The third column is a value at this position in the matrix. When values are missing the value is just zero.
I am well aware of the fact, that data formats like the .mtx format exist, but I would like to create a scipy sparse matrix or numpy array from this txt file alone instead of adjusting it to the .mtx file format. Is there a Python function out there, which does this for me, which I am missing?
import numpy
with open('filename.txt','r') as f:
lines = f.readlines()
f.close()
data = [i.split(' ') for i in lines]
z = list(zip(*data))
row_indices = list(map(int,z[0]))
column_indices = list(map(int,z[1]))
values = list(map(float,z[2]))
m = max(row_indices)+1
n = max(column_indices)+1
p = max([m,n])
A = numpy.zeros((p,p))
A[row_indices,column_indices]=values
print(A)
If you want a square matrix with maximum of column 1 as the number of rows and and the maximum of column 2 to be the size, then you can remove p = max([m,n]) and replace A = numpy.zeros((p,p)) with A = numpy.zeros((m,n)).
Starting from the array (a) sorted on the first column (major) and second (minor) as in your example, you can reshape:
# a = np.loadtxt('filename')
x = len(np.unique(a[:,0]))
y = len(np.unique(a[:,1]))
a[:,2].reshape(x,y).T
Output:
array([[4. , 2.1],
[5.2, 2.9],
[2.1, 3.1]])
import pandas as pd
import numpy as np
data_dir = 'data_r14.csv'
data = pd.read_csv(data_dir)
# print(data)
signals = data['signal']
value_counts = signals.value_counts()
buy_count = value_counts[1]
signals_code = [1, 2]
buy_sell_rows = data.loc[data['signal'].isin(signals_code)]
data_without_signals = data[~data['signal'].isin(signals_code)]
random_0_indexes = np.random.choice(data_without_signals.index.values, buy_count)
value_counts2 = data_without_signals['signal'].value_counts()
# print(value_counts2)
for index in random_0_indexes:
row = data.loc[index, :]
# df = row.to_frame()
print(row)
buy_sell_rows.append(row)
# print(buy_sell_rows)
# print(signals.loc[index, :])
# print(random_0_rows)
print(buy_sell_rows)
# print(buy_sell_rows['signal'].value_counts())
So I have a dataframe where I have a column named signal where the values are either 0, 1, or 2 and I want to balance them by having equal amount rows for each value because they are very unbalanced I have only 1984 row of non zero value and over 20000 rows of zero value.
So I created a new dataframe where all the values are zeroes and called it data_without_signals then I get a random list of indexes from it, then I run a loop to get that row to append it to another dataframe I created called buy_sell_rows where only non zero values are in, but the issue is that row is being appened.
As said in my comment, I think your general approach could be simplified by randomly sampling the different signals:
# my test signal of 0s, 1s and 2s
test = pd.DataFrame({"data" : [0,0,0,1,1,1,1,1,1,1,2,2,2,2,2,2]})
# get the lowest size of any group, which is the size all groups should be reduced to
max_size = test.groupby("data")["data"].count().min()
# sample
output = (test
.groupby(["data"])
.agg(sample = ("data", lambda x : x.sample(max_size).to_list()))
.explode("sample")
.reset_index(drop=True)
)
and the output for this test is:
sample
0
0
1
0
2
0
3
1
4
1
5
1
6
2
7
2
8
2
I have a csv that contains 12 cols and 4 rows of data.
As seen in the img
I would like to divide each of those values by their area of which I have created an array, and then multiply by 100 to get a % and have these values in a new column.
Image for array
So for example, A2, A3, A4, will all be divided by 52,600 and then x100.
My current df looks like this dataframe
I interpreted your request for a new column to be a new column for each Sub_* in your spreadsheet, since there were 12 values in your numpy array.
Code edit: I see you wanted to do the math to the 'Baseline' column as well. So I step through each column except the first (which is "Label" and at index 0)
import numpy as np
import pandas as pd
df = pd.read_excel("d:\stack67477476.xlsx")
area_arr = np.array([[52.6, 14.966, 7.702, 4.169, 3.71, 5.648, 6.785, 1.867, 5.268, 4.989, 1.659, 6.538]])
for ii, col in enumerate(df.columns):
if ii == 0:
continue
df[col + "_Area"] = round(df[col] / area_arr[0][ii - 1] * 100, 2)
This is vectorized in one dimension (the 4 rows dimension) but loops through the 12 columns dimension. The output is as follows (don't quote me on this, I may have typed your inputs incorrectly):
df
Label Baseline Sub_A Sub_B Sub_C Sub_D Sub_E Sub_F Sub_G Sub_H Sub_I ... Sub_A_Area Sub_B_Area Sub_C_Area Sub_D_Area Sub_E_Area Sub_F_Area Sub_G_Area Sub_H_Area Sub_I_Area Sub_J_Area Sub_K_Area
0 0 0 15535 5128 8847 10784 5679 20481 8398 10012 5162 ... 103801.95 66580.11 212209.16 290673.85 100548.87 301857.04 449812.53 190053.15 103467.63 275527.43 380177.42
1 1 159506 149454 157456 155680 154327 154671 146863 150761 150446 155335 ... 998623.55 2044352.12 3734228.83 4159757.41 2738509.21 2164524.69 8075040.17 2855846.62 3113549.81 9387040.39 1963949.22
2 2 129087 111918 121515 122066 119557 123813 114746 123140 122156 125480 ... 747815.05 1577707.09 2927944.35 3222560.65 2192156.52 1691171.70 6595607.93 2318830.68 2515133.29 7608679.93 1653533.19
3 3 137562 102318 114509 124641 127442 130324 123331 130392 130715 134528 ... 683669.65 1486743.70 2989709.76 3435094.34 2307436.26 1817700.81 6984038.56 2481302.20 2696492.28 8123206.75 1881890.49
4 4 35901 26488 30836 33756 34549 34000 33269 34071 34151 35149 ... 176987.84 400363.54 809690.57 931239.89 601983.00 490331.61 1824906.27 648272.59 704529.97 2146473.78 531691.65
[5 rows x 25 columns]
Note that it's unclear why your numpy array is 2D, one assumes there is something deeper to that in the rest of your code. Seems it would be clearer to avoid a set of braces:
area_arr = np.array([52.6, 14.966, 7.702, 4.169, 3.71, 5.648, 6.785, 1.867, 5.268, 4.989, 1.659, 6.538])
And simplify the divisor to just:
area_arr[ii] # not area_arr[0][ii]
or for that matter, a simple list would be ok, since numpy isn't needed here.
Apologies if we have miscommunicated on commas and decimal points, but the code still works if you change the numbers.
I'm making my way around GroupBy, but I still need some help. Let's say that I've a DataFrame with columns Group, giving objects group number, some parameter R and spherical coordinates RA and Dec. Here is a mock DataFrame:
df = pd.DataFrame({
'R' : (-21.0,-21.5,-22.1,-23.7,-23.8,-20.4,-21.8,-19.3,-22.5,-24.7,-19.9),
'RA': (154.362789,154.409301,154.419191,154.474165,154.424842,162.568516,8.355454,8.346812,8.728223,8.759622,8.799796),
'Dec': (-0.495605,-0.453085,-0.481657,-0.614827,-0.584243,8.214719,8.355454,8.346812,8.728223,8.759622,8.799796),
'Group': (1,1,1,1,1,2,2,2,2,2,2)
})
I want to built a selection containing for each group the "brightest" object, i.e. the one with the smallest R (or the greatest absolute value, since Ris negative) and the 3 closest objects of the group (so I keep 4 objects in each group - we can assume that there is no group smaller than 4 objects if needed).
We assume here that we have defined the following functions:
#deg to rad
def d2r(x):
return x * np.pi / 180.0
#rad to deg
def r2d(x):
return x * 180.0 / np.pi
#Computes separation on a sphere
def calc_sep(phi1,theta1,phi2,theta2):
return np.arccos(np.sin(theta1)*np.sin(theta2) +
np.cos(theta1)*np.cos(theta2)*np.cos(phi2 - phi1) )
and that separation between two objects is given by r2d(calc_sep(RA1,Dec1,RA2,Dec2)), with RA1 as RA for the first object, and so on.
I can't figure out how to use GroupBy to achieve this...
What you can do here is build a more specific helper function that gets applied to each "sub-frame" (each group).
GroupBy is really just a facility that creates something like an iterator of (group id, DataFrame) pairs, and a function is applied to each of these when you call .groupby().apply. (That glazes over a lot of details, see here for some details on internals if you're interested.)
So after defining your three NumPy-based functions, also define:
def sep_df(df, keep=3):
min_r = df.loc[df.R.argmin()]
RA1, Dec1 = min_r.RA, min_r.Dec
sep = r2d(calc_sep(RA1,Dec1,df['RA'], df['Dec']))
idx = sep.nsmallest(keep+1).index
return df.loc[idx]
Then just apply and you get a MultiIndex DataFrame where the first index level is the group.
print(df.groupby('Group').apply(sep_df))
Dec Group R RA
Group
1 3 -0.61483 1 -23.7 154.47416
2 -0.48166 1 -22.1 154.41919
0 -0.49561 1 -21.0 154.36279
4 -0.58424 1 -23.8 154.42484
2 8 8.72822 2 -22.5 8.72822
10 8.79980 2 -19.9 8.79980
6 8.35545 2 -21.8 8.35545
9 8.75962 2 -24.7 8.75962
With some comments interspersed:
def sep_df(df, keep=3):
# Applied to each sub-Dataframe (this is what GroupBy does under the hood)
# Get RA and Dec values at minimum R
min_r = df.loc[df.R.argmin()] # Series - row at which R is minimum
RA1, Dec1 = min_r.RA, min_r.Dec # Relevant 2 scalars within this row
# Calculate separation for each pair including minimum R row
# The result is a series of separations, same length as `df`
sep = r2d(calc_sep(RA1,Dec1,df['RA'], df['Dec']))
# Get index values of `keep` (default 3) smallest results
# Retain `keep+1` values because one will be the minimum R
# row where separation=0
idx = sep.nsmallest(keep+1).index
# Restrict the result to those 3 index labels + your minimum R
return df.loc[idx]
For speed, consider passing sort=False to GroupBy if the result still works for you.
I want to built a selection containing for each group the "brightest" object...and the 3 closest objects of the group
step 1:
create a dataframe for the brightest object in each group
maxR = df.sort_values('R').groupby('Group')['Group', 'Dec', 'RA'].head(1)
step 2:
merge the two frames on Group & calculate the separation
merged = df.merge(maxR, on = 'Group', suffixes=['', '_max'])
merged['sep'] = merged.apply(
lambda x: r2d(calc_sep(x.RA, x.Dec, x.RA_max, x.Dec_max)),
axis=1
)
step 3:
order the data frame, group by 'Group', (optional) discard intermediate fields & take the first 4 rows from each group
finaldf = merged.sort_values(['Group', 'sep'], ascending=[1,1]
).groupby('Group')[df.columns].head(4)
Produces the following data frame with your sample data:
Dec Group R RA
4 -0.584243 1 -23.8 154.424842
3 -0.614827 1 -23.7 154.474165
2 -0.481657 1 -22.1 154.419191
0 -0.495605 1 -21.0 154.362789
9 8.759622 2 -24.7 8.759622
8 8.728223 2 -22.5 8.728223
10 8.799796 2 -19.9 8.799796
6 8.355454 2 -21.8 8.355454
I have a couple of files 100 MB each. The format for those files looks like this:
0 1 2 5 8 67 9 122
1 4 5 2 5 8
0 2 1 5 6
.....
(note the actual file does not have the alignment spaces added in, only one space separates each element, added alignment for aesthetic effect)
this first element in each row is it's binary classification, and the rest of the row are indices of features where the value is 1. For instance, the third row says the row's second, first, fifth and sixth features are 1, the rest are zeros.
I tried to read each line from each file, and use sparse.coo_matrix create a sparse matrix like this:
for train in train_files:
with open(train) as f:
row = []
col = []
for index, line in enumerate(f):
record = line.rstrip().split(' ')
row = row+[index]*(len(record)-4)
col = col+record[4:]
row = np.array(row)
col = np.array(col)
data = np.array([1]*len(row))
mtx = sparse.coo_matrix((data, (row, col)), shape=(n_row, max_feature))
mmwrite(train+'trans',mtx)
but this took forever to finish. I started reading the data at night, and let the computer run after I went to sleep, and when I woke up, it still haven't finish the first file!
What are the better ways to process this kind of data?
I think this would be a bit faster than your method because it does not read file line by line. You can try this code with a small portion of one file and compare with your code.
This code also requires to know the feature number in advance. If we don't know the feature number, it would require another line of code which was commented out.
import pandas as pd
from scipy.sparse import lil_matrix
from functools import partial
def writeMx(result, row):
# zero-based matrix requires the feature number minus 1
col_ind = row.dropna().values - 1
# Assign values without duplicating row index and values
result[row.name, col_ind] = 1
def fileToMx(f):
# number of features
col_n = 136
df = pd.read_csv(f, names=list(range(0,col_n+2)),sep=' ')
# This is the label of the binary classification
label = df.pop(0)
# Or get the feature number by the line below
# But it would not be the same across different files
# col_n = df.max().max()
# Number of row
row_n = len(label)
# Generate feature matrix for one file
result = lil_matrix((row_n, col_n))
# Save features in matrix
# DataFrame.apply() is usually faster than normal looping
df.apply(partial(writeMx, result), axis=0)
return(result)
for train in train_files:
# result is the sparse matrix you can further save or use
result = fileToMx(train)
print(result.shape, result.nnz)
# The shape of matrix and number of nonzero values
# ((420, 136), 15)