Referencing columns by assigned name in numpy array

Referencing columns by assigned name in numpy array - python

I am trying to create column names for easy reference, That way I can just call the name from the rest of the program instead of having to know which column is where in terms of placement. The from_ column array is coming up empty. I am new to numpy so I am just wondering how this is done. Changing of data type for columns 5 and 6 was successful though.
def array_setter():
import os
import glob
import numpy as np
os.chdir\
('C:\Users\U2970\Documents\Arcgis\Text_files\Data_exports\North_data_folder')
for file in glob.glob('*.TXT'):
reader = open(file)
headerLine = reader.readlines()
for col in headerLine:
valueList = col.split(",")
data = np.array([valueList])
from_ = np.array(data[1:,[5]],dtype=np.float32)
# trying to assign a name to columns for easy reference
to = np.array(data[1:,[6]],dtype=np.float32)
if data[:,[1]] == 'C005706N':
if data[:,[from_] < 1.0]:
print data[:,[from_]]
array_setter()

If you want to index array columns by name name, I would recommend turning the array into a pandas dataframe. For example,
import pandas as pd
import numpy as np
arr = np.array([[1, 2], [3, 4]])
df = pd.DataFrame(arr, columns=['f', 's'])
print df['f']
The nice part of this approach is that the arrays still maintain all their structure but you also get all the optimized indexing/slicing/etc. capabilities of pandas. For example, if you wanted to find elements of 'f' that corresponded to elements of 's' being equal to some value a, then you could use loc
a = 2
print df.loc[df['s']==2, 'f']
Check out the pandas docs for different ways to use the DataFrame object. Or you could read the book by Wes McKinney (pandas creator), Python for Data Analysis. Even though it was written for an older version of pandas, it's a great starting point and will set you in the right direction.

Related

Is there a way to add two arrays in two columns in to a third array using pands

I am working on a project, which uses pandas data frame. So in there, I received some values in to the columns as below.
In there, I need to add this Pos_vec column and word_vec column and need to create a new column called the sum_of_arrays. And the size of the third column's array size should 2.
Eg: pos_vec Word_vec sum_of_arrays
[-0.22683072, 0.32770252] [0.3655883, 0.2535131] [0.13875758,0.58121562]
Is there anyone who can help me? I'm stuck in here. :(

If you convert them to np.array you can simply sum them.
import pandas as pd
import numpy as np
df = pd.DataFrame({'pos_vec':[[-0.22683072,0.32770252],[0.14382899,0.049593687],[-0.24300802,-0.0908088],[-0.2507714,-0.18816864],[0.32294357,0.4486494]],
'word_vec':[[0.3655883,0.2535131],[0.33788466,0.038143277], [-0.047320127,0.28842866],[0.14382899,0.049593687],[-0.24300802,-0.0908088]]})
If you want to use numpy
df['col_sum'] = df[['pos_vec','word_vec']].applymap(lambda x: np.array(x)).sum(1)
If you don't want to use numpy
df['col_sum'] = df.apply(lambda x: [sum(x) for x in zip(x.pos_vec,x.word_vec)], axis=1)

There are maybe cleaner approaches possible using pandas to iterate over the columns, however this is the solution I came up with by extracting the data from the DataFrame as lists:
# Extract data as lists
pos_vec = df["pos_vec"].tolist()
word_vec = df["word_vec"].tolist()
# Create new list with desired calculation
sum_of_arrays = [[x+y for x,y in zip(l1, l2)] for l1,l2 in zip(pos,word)]
# Add new list to DataFrame
df["sum_of_arrays"] = sum_of_arrays

Correctly using np.where() to output dataframe rather than a tuple

I am trying return the Date (index column using set_index()) where the measurement at one location is twice the measurement at another. I need to show the dates and the speeds at the first location.
Here is what I have so far...
c = np.where(data['Loc1']== 2*data['Loc9'])
c
which returns a truple... How can I get it to show the dates and wind speeds?
Slowly learning Python here.

np.where will return the indices to the rows meeting the criteria, so you then need to use the indices to select the rows of the DataFrame, e.g. using .loc:
import pandas as pd
import numpy as np
# Setup:
data = pd.DataFrame({"Loc1":[4,2,3], "Loc9": [2, 3, 4]})
c = np.where(data['Loc1']== 2*data['Loc9'])
data.loc[c]

adding new column with values calculated from other columns

I was wandering how you could add a new column with values calculated from other columns.
Let's say 'small' is one that has a value less than 5. I want to record this information in a new column called 'small' where this column has value 1 if it is short and 0 otherwise.
I know you can do that with numpy like this :
import pandas as pd
import numpy as np
rawData = pd.read_csv('data.csv', encoding = 'ISO-8859-1')
rawData['small'] = np.where(rawData['value'] < 5, '1', '0')
How would you do the same operation without using numpy?
I've read about numpy.where() but I still don't get it.
Help would be really appreciated thanks!

You can convert boolean mask to integer and then to string:
rawData['small'] = (rawData['value'] < 5).astype(int).astype(str)

What is the correct way of passing parameters to stats.friedmanchisquare based on a DataFrame?

I am trying to pass values to stats.friedmanchisquare from a dataframe df, that has shape (11,17).
This is what works for me (only for three rows in this example):
df = df.as_matrix()
print stats.friedmanchisquare(df[1, :], df[2, :], df[3, :])
which yields
(16.714285714285694, 0.00023471398805908193)
However, the line of code is too long when I want to use all 11 rows of df.
First, I tried to pass the values in the following manner:
df = df.as_matrix()
print stats.friedmanchisquare([df[x, :] for x in np.arange(df.shape[0])])
but I get:
ValueError:
Less than 3 levels. Friedman test not appropriate.
Second, I also tried not converting it to a matrix-form leaving it as a DataFrame (which would be ideal for me), but I guess this is not supported yet, or I am doing it wrong:
print stats.friedmanchisquare([row for index, row in df.iterrows()])
which also gives me the error:
ValueError:
Less than 3 levels. Friedman test not appropriate.
So, my question is: what is the correct way of passing parameters to stats.friedmanchisquare based on df? (or even using its df.as_matrix() representation)
You can download my dataframe in csv format here and read it using:
df = pd.read_csv('df.csv', header=0, index_col=0)
Thank you for your help :)
Solution:
Based on #Ami Tavory and #vicg's answers (please vote on them), the solution to my problem, based on the matrix representation of the data, is to add the *-operator defined here, but better explained here, as follows:
df = df.as_matrix()
print stats.friedmanchisquare(*[df[x, :] for x in np.arange(df.shape[0])])
And the same is true if you want to work with the original dataframe, which is what I ideally wanted:
print stats.friedmanchisquare(*[row for index, row in df.iterrows()])
in this manner you iterate over the dataframe in its native format.
Note that I went ahead and ran some timeit tests to see which way is faster and as it turns out, converting it first to a numpy array beforehand is twice as fast than using df in its original dataframe format.
This was my experimental setup:
import timeit
setup = '''
import pandas as pd
import scipy.stats as stats
import numpy as np
df = pd.read_csv('df.csv', header=0, index_col=0)
'''
theCommand = '''
df = np.array(df)
stats.friedmanchisquare(*[df[x, :] for x in np.arange(df.shape[0])])
'''
print min(timeit.Timer(stmt=theCommand, setup=setup).repeat(10, 10000))
theCommand = '''
stats.friedmanchisquare(*[row for index, row in df.iterrows()])
'''
print min(timeit.Timer(stmt=theCommand, setup=setup).repeat(10, 10000))
which yields the following results:
4.97029900551
8.7627799511

The problem I see with your first attempt is that you end up passing one list with multiple dataframes inside of it.
The stats.friedmanchisquare needs multiple array_like arguments, not one list
Try using the * (star/unpack) operator to unpack the list
Like this
df = df.as_matrix()
print stats.friedmanchisquare(*[df[x, :] for x in np.arange(df.shape[0])])

You could pass it using the "star operator", similarly to this:
a = np.array([[1, 2, 3], [2, 3, 4] ,[4, 5, 6]])
friedmanchisquare(*(a[i, :] for i in range(a.shape[0])))

TypeError: list indices must be integers, not tuple

My code:`
#!/usr/bin/python
with open("a.dat", "r") as ins:
array = []
for line in ins:
l=line.strip()
array.append(l)
a1 = array[:,1]
print a1
I want to read a.dat as array then take the first column.What is wrong?

For loading numerical data, it's often useful to use numpy instead of just Python.
import numpy as np
arr = np.loadtxt('a.dat')
print arr[:,0]
numpy is a Python library that's very well suited to loading and manipulating numerical data (with a bonus that when used correctly, it's waaaay faster than using Python lists). In addition, for dealing with tabular data with mixed datatypes, I recommend using pandas.
import pandas as pd
df = pd.load_csv('a.dat', sep=' ', names=['col1', 'col2'])
print df['col1']
Numpy can be found here
Pandas can be found here

This is wrong: a1 = array[:,1] putting values separated with comma make it a Tuple of 2 values. You should use:
a1 = array[0]
To get first row or to get first column use:
column = [row[0] for row in array]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Referencing columns by assigned name in numpy array - python

Related

Is there a way to add two arrays in two columns in to a third array using pands

Correctly using np.where() to output dataframe rather than a tuple

adding new column with values calculated from other columns

What is the correct way of passing parameters to stats.friedmanchisquare based on a DataFrame?

TypeError: list indices must be integers, not tuple

Categories

Resources