python(numpy): how to read specific columns from CSV file? - python

I have a CSV data file, 100 columns * 100,000 lows and one header.
First, I want to make a list containing 1st, 3rd, and 5th to 100,000th columns data of original CSV data file.
In that case, I think I can use the script like below.
#Load data
xy = np.loadtxt('CSV data.csv', delimiter=',', skiprows=1)
x = xy[:,[1,3,5,6,7,8,9,10,11 .......,100000]]
But, as you know, it is not good method. It is difficult to type and it is not good for generalization.
First, I thought the below script could be used but, failed.
x = xy[:,[1,3,5:100000]]
How can I make a separate list using specific columns data, separated and continuous?

np.r_ is a convenience function (actually an object that takes []), that generates an array of indices:
In [76]: np.r_[1,3,5:100]
Out[76]:
array([ 1, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36,
37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70,
71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87,
88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99])
This should be usable for both xy[:,np.r_[...]] and the usecols parameter.
In [78]: np.arange(300).reshape(3,100)[:,np.r_[1,3,5:100:10]]
Out[78]:
array([[ 1, 3, 5, 15, 25, 35, 45, 55, 65, 75, 85, 95],
[101, 103, 105, 115, 125, 135, 145, 155, 165, 175, 185, 195],
[201, 203, 205, 215, 225, 235, 245, 255, 265, 275, 285, 295]])

Just use the usecols parameter in np.loadtxt().:
https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.loadtxt.html

Another option is to define x by removing columns from xy:
x = np.delete(xy, [0,2,4], axis=1)

Related

Need help extracting elements from a list

templineNums = [1,2,3,4,5,6,7,8,9,10,20,21,22,23,24 ... 1000]
splitLines = [24, 36, 28, 30 .. ]
Using the elements in the splitLines list, I need too print that number (split line elements) of templineNums indicies.
For example, the first element of splitLines is 24, which mean i need to print the first 24 elements of templineNums. Then i need to keep looping through the element of splitLines. The next element is 36, which means I need the next 36 elements of where i last left off from the previous index (i.e [24:61. 61 because it is "exclusive"). I need to keep doing this until i loop through all of the elements in splitLines and print out that number of elements from templineNums.
How may I do this? Note: There will not be an array out of bounds error because templineNums has over 1000+ numbers in the list and splitLines will not go past that.
Thanks!
this is the answer to your question and It is a good question I don't know why the other didn't like
templineNums = list(range(1,1000))
splitLines = [24, 36, 28, 30 ]
last=0
for i in splitLines:
print(templineNums[last:last+i])
last+=i
output
In [9]: templineNums = list(range(1,1000))
...: splitLines = [24, 36, 28, 30 ]
...:
...: last=0
...: for i in splitLines:
...: print(templineNums[last:last+i])
...: last+=i
...:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]
[25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60]
[61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88]
[89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118]

Adding each value in an RDD to its partition number

Quite new to PySpark so this might be simple. I have an RDD that ranges from 1 to 100 and has 4 partitions.
A = sc.parallelize(range(100), 4)
And I have to find a way to return another RDD where each value in the RDD is added to its partition number. The ideal example would be:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
20, 21, 22, 23, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 52, 53, 54, 55, 56, 57, 58, 59,
60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 78, 79,
80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98,
99, 100, 101, 102]
Would like to know how I could amend the following code to get the desired results.
A = sc.parallelize(range(100), 4)
B =
print(B.collect())

Is it possible to make a list of numbers from 1 to n using a loop in python 3.6, then print it?

I am creating a board game with python, and i need to use numbers as coordinates. There are 121 coordinates, and i want to make a list for it. But i don't want to type all the numbers from 1 to 121. I found this code:
coordinates = [range(1,121)]
I tried to print it, but it didn't return the numbers from 1 to 121, just the actual function: range(1,121). What happened??
You can try this:
coordinates=[]
for i in range(1,122):
coordinates.append(i)
print(coordinates)
Output:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121]
range didn't generate list or tuple, you can try:
coordinates = list(range(1,121))
The description in python doc:
The advantage of the range type over a regular list or tuple is that a
range object will always take the same (small) amount of memory, no
matter the size of the range it represents (as it only stores the
start, stop and step values, calculating individual items and
subranges as needed).
The solution you have found will work for Python-2.x only.
You can achieve the same in many ways as described below:
Using list comprehension:
[i for i in range(1,121)]
unpacking range with *:
[*range(1,121)]
using numpy:
import numpy
numpy.arange(1, 121)
Note:
you mentioned in your question that you want to print numbers 1 to 121 (1 to n) to generate nth value you need to use n+1 th value instead of nth

How to select specific column ranges in pandas? [duplicate]

I have to read several files some in Excel format and some in CSV format. Some of the files have hundreds of columns.
Is there a way to select several ranges of columns without specifying all the column names or positions? For example something like selecting columns 1 -10, 15, 17 and 50-100:
df = df.ix[1:10, 15, 17, 50:100]
I need to know how to do this both when creating dataframe from Excel files and CSV files and after the data framers created.
use np.r_
np.r_[1:10, 15, 17, 50:100]
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 15, 17, 50, 51, 52, 53, 54, 55,
56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72,
73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89,
90, 91, 92, 93, 94, 95, 96, 97, 98, 99])
so you can do
df.iloc[:, np.r_[1:10, 15, 17, 50:100]]
I find #piRSquared 's answer straightforward.
You may also use:
Locs = list(range(0,10)) + [14, 16] + list(range(49, 100))
# columns 1 -10, 15, 17 and 50-100
# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 14, 16, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]
df = df.iloc[:Locs]
use inner join
like
result = pd.concat([df1, df4], axis=1, join="inner")

How to turn a boolean array into index array in numpy

Is there an efficient Numpy mechanism to retrieve the integer indexes of locations in an array based on a condition is true as opposed to the Boolean mask array?
For example:
x=np.array([range(100,1,-1)])
#generate a mask to find all values that are a power of 2
mask=x&(x-1)==0
#This will tell me those values
print x[mask]
In this case, I'd like to know the indexes i of mask where mask[i]==True. Is it possible to generate these without looping?
Another option:
In [13]: numpy.where(mask)
Out[13]: (array([36, 68, 84, 92, 96, 98]),)
which is the same thing as numpy.where(mask==True).
You should be able to use numpy.nonzero() to find this information.
If you prefer the indexer way, you can convert your boolean list to numpy array:
print x[nd.array(mask)]
np.arange(100,1,-1)
array([100, 99, 98, 97, 96, 95, 94, 93, 92, 91, 90, 89, 88,
87, 86, 85, 84, 83, 82, 81, 80, 79, 78, 77, 76, 75,
74, 73, 72, 71, 70, 69, 68, 67, 66, 65, 64, 63, 62,
61, 60, 59, 58, 57, 56, 55, 54, 53, 52, 51, 50, 49,
48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36,
35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23,
22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10,
9, 8, 7, 6, 5, 4, 3, 2])
x=np.arange(100,1,-1)
np.where(x&(x-1) == 0)
(array([36, 68, 84, 92, 96, 98]),)
Now rephrase this like :
x[x&(x-1) == 0]

Categories

Resources