I want to average the third column onwards - python

I hope you are well. I am new. I am trying to add certain columns but not to all, and I require your help.
W=[[77432664,6,2,4,3,4,3],
[6233234,7,3,2,5,3,1],
[3412455221,8,3,2,4,5,5]]
rows=len(W)
columns=len(W[0])
for i in range(rows):
T=sum(W[i])
W[i].append(T)

I assume by "add" you mean "sum" and not "insert". If so, then you can use what is called a slice:
for row in rows:
t = sum(row[1:])
row.append(t)
row[1:] takes all but the first element of the list row. For more information on this syntax, you should google "python slice".
Also notice how I am iterating over rows directly, rather than using an index. This is the most common way to do a loop in Python.

You can create a subarray in python by specifying the column range and then add it. Below code demonstrate the addition of column 2,3,4,5,6 in Python.
W=[[77432664,6,2,4,3,4,3],
[6233234,7,3,2,5,3,1],
[3412455221,8,3,2,4,5,5]]
rows=len(W)
columns=len(W[0])
for i in range(rows):
T=sum(W[i][2:6]) #For i=0 it retreives subarray [2,4,3,4,3] then add it to get T=16
W[i].append(T)

I'd suggest using pandas sum method using over axis=0:
# numeric of columns
my_cols_n = [2,3,4,5,6]
# Get cols by name
my_cols = [x for x,i in enumerate(list(df.columns)) if i in my_cols_n]
# Get Sum
df["my_sum"] = df[my_cols].sum(axis=0)

To add to #Code-Apprentice answer - consider using numpy for similar assignments:
import numpy as np
W=[[77432664,6,2,4,3,4,3],
[6233234,7,3,2,5,3,1],
[3412455221,8,3,2,4,5,5]]
W=np.array(W)
>>> print(W[:, 3:].mean(axis=1))
[3.5 2.75 4. ]
Especially with the growth of complexity of matrix operations - you will quickly see big advantages of numpy

Related

Understanding np.ix_

Code:
import numpy as np
ray = [1,22,33,42,51], [61,71,812,92,103], [113,121,132,143,151], [16,172,183,19,201]
ray = np.asarray(ray)
type(ray)
ray[np.ix_([-2:],[3:4])]
I'd like to use index slicing and get a subarray consisting of the last two rows and the 3rd/4th columns. My current code produces an error:
I'd also like to sum each column. What am I doing wrong? I cannot post a picture because I need at least 10 reputation points.
So you want to make a slice of an array. The most straightforward way to do it is... slicing:
slice = ray[-2:,3:]
or if you want it explicitly
slice = ray[-2:,3:5]
See it explained in Understanding slicing
But if you do want to use np.ix_ for some reason, you need
slice = ray[np.ix_([-2,-1],[3,4])]
You can't use : here, because [] here don't make a slice, they construct lists and you should specify explicitly every row number and every column number you want in the result. If there are too many consecutive indices, you may use range:
slice = ray[np.ix_(range(-2, 0),range(3, 5))]
And to sum each column:
slice.sum(0)
0 means you want to reduce the 0th dimension (rows) by summation and keep other dimensions (columns in this case).

Function to get Row and column of panda dataset

I have a csv dataset with texts. I need to search through them. I couldn't find an easy way to search for a string in a dataset and get the row and column indexes. For example, let's say the dataset is like:
df = pd.DataFrame({"China": ['Xi','Lee','Hung'], "India": ['Roy','Rani','Jay'], "England": ['Tom','Sam','Jack']})
Now let's say I want to find the string 'rani' and know its location. Is there a simple function to do that? Or do I have to loop through everything to find it?
One vectorized (and therefore relatively scalable) solution to this is to leverage numpy.where:
import numpy as np
np.where(df == 'Rani')
This returns two arrays, corresponding to column and row indices:
(array([1]), array([1]))
You can continue to take advantage of vectorized operations, but also write a more complicated filtering function, like so:
np.where(df.applymap(lambda x: "ani" in x))
In other words, "apply to each cell the function that returns True if 'ani' is in the cell", and then conduct the same np.where filtering step.
You can use any function:
def _should_include_cell(cell_contents):
return cell_contents.lower() == "rani" or "Xi" in cell_contents
np.where(df.applymap(_should_include_cell)
Some final notes:
applymap is slower than simple equality checking
if you need this to scale WAY up, consider using dask instead of pandas
Not sure how this will scale but it works
df[df.eq('Rani')].dropna(1, how='all').dropna()
India
1 Rani

How can I work with .iloc[] in Python to do some calculation?

I have to implement some functions to calculate special values. I read a csv file for it with pd.read_csv(). Then I used .iloc[] to find the respective row and column I need for my calculation:
V_left = data_one.iloc[0,0:4]
V_right= data_one.iloc[0,5:9]
My formula, which I want to implement is: V_left/V_right
V is a vector of 5 parameters (values).
My question is now: How can I use the values, which I pick out with .iloc[], to do a calculation like my formula?
See me current code here
You can use:
V_left.values and V_right.values to make those dataframes numpy arrays, so that you can manipulate them.
However, I wouldn't use iloc in the first place, you can directly convert them:
V_left = data_one.values[0,:4]
V_right = data_one.values[0, 5:9]
Adding V_left.values / V_right.values should be enough.

Pulling elements in order based on first element using key array

I'm looking for a vectorized approach for the following problem:
Suppose I have two arrays, one with a bunch of non-contiguous ids in the first column and some data in the remaining columns, and a second array suggesting which datalines I need to pull:
data_array = np.array([[101,4],[102,7],[201,2],[203,9],[403,12]])
key_array = np.array([101,403,201])
The output must stay in the order given by the key_array, leading to the following:
output_array = np.array([[101,4],[403,12],[201,2]])
I can easily do this through a list comprehension:
output_array = np.array([data_array[i==data_array[:,0]][0] for i in key_array])
but this is not a vectorized solution. Using the numpy isin() is very close to working, but does not preserve the given order:
data_array[np.isin(data_array[:,0],key_array)]
#[[101 4]
# [201 2] not the order given by the key_array!
# [403 12]]
I tried making the above work by some use of argsort(), haven't been able to get anything working. Any help would be greatly appreciated.
We can use np.searchsorted -
s = data_array[:,0].argsort()
out = data_array[s[np.searchsorted(data_array[:,0],key_array,sorter=s)]]
If the first column of data_array is already sorted, simplifies to one-liner -
out = data_array[np.searchsorted(data_array[:,0],key_array)]

Python: Iterate an operation across different columns of one row for all rows of a graphlab.SFrame

There is a SFrame with columns having dict elements.
import graphlab
import numpy as np
a = graphlab.SFrame({'col1':[{'oshan':3,'modi':4},{'ravi':1,'kishan':5}],
'col2':[{'oshan':1,'rawat':2},{'hari':3,'kishan':4}]})
I want to calculate cosine distance between these two columns for each row of the SFrame. Below is the operation using for loop.
dis = np.zeros(len(a),dtype = float)
for i in range(len(a)):
dis[i] = graphlab.distances.cosine(a['col1'][i],a['col2'][i])
a['distance12'] = dis
This is very inefficient and would take hours if the number of rows was large. Could someone please suggest a better approach.
You can usually avoid looping over an SFrame by using the apply function. In your case, it would look like this:
a.apply(lambda row: graphlab.distances.cosine(row['col1'], row['col2']))
That should be significantly faster than looping in Python.

Categories

Resources