I have to read several files some in Excel format and some in CSV format. Some of the files have hundreds of columns.
Is there a way to select several ranges of columns without specifying all the column names or positions? For example something like selecting columns 1 -10, 15, 17 and 50-100:
df = df.ix[1:10, 15, 17, 50:100]
I need to know how to do this both when creating dataframe from Excel files and CSV files and after the data framers created.
use np.r_
np.r_[1:10, 15, 17, 50:100]
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 15, 17, 50, 51, 52, 53, 54, 55,
56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72,
73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89,
90, 91, 92, 93, 94, 95, 96, 97, 98, 99])
so you can do
df.iloc[:, np.r_[1:10, 15, 17, 50:100]]
I find #piRSquared 's answer straightforward.
You may also use:
Locs = list(range(0,10)) + [14, 16] + list(range(49, 100))
# columns 1 -10, 15, 17 and 50-100
# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 14, 16, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]
df = df.iloc[:Locs]
use inner join
like
result = pd.concat([df1, df4], axis=1, join="inner")
Related
Quite new to PySpark so this might be simple. I have an RDD that ranges from 1 to 100 and has 4 partitions.
A = sc.parallelize(range(100), 4)
And I have to find a way to return another RDD where each value in the RDD is added to its partition number. The ideal example would be:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
20, 21, 22, 23, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 52, 53, 54, 55, 56, 57, 58, 59,
60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 78, 79,
80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98,
99, 100, 101, 102]
Would like to know how I could amend the following code to get the desired results.
A = sc.parallelize(range(100), 4)
B =
print(B.collect())
I am completely stuck. In my original dataframe I have 1 column of interest (fluorescence) and I want to take a fixed amount of elements (=3, color yellow) at fixed interval (5) and average them. The output should be saved into a NewList.
fluorescence = df.iloc[1:20, 0]
fluorescence=pd.to_numeric(fluorescence)
## add a list to count
fluorescence['time']= list(range(1,20,1))
## create a list with interval
interval = list(range(1, 20, 5))
NewList=[]
for i in range(len(fluorescence)):
if fluorescence['time'][i] == interval[i]:
NewList.append(fluorescence[fluorescence.tail(3).mean()])
print(NewList)
Any input is welcome!!
Thank you in advance
Here, I'm taking subset of dataframe for every 5 consecutive iterations and taking tail 3 rows mean
import pandas as pd
fluorescence=pd.DataFrame([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
NewList=[]
j=0
for i1 in range(4,len(fluorescence),5):
NewList.append(fluorescence.loc[j:i1,0].tail(3).mean())
j=i1
print(NewList)
If you have a list of data and you want to grab 3 entries out of every 5 you can segment your list as follows:
from statistics import mean
data = [63, 64, 43, 91, 44, 84, 14, 43, 87, 53, 81, 98, 34, 33, 60, 82, 86, 6, 81, 96, 99, 10, 76, 73, 63, 89, 70, 29, 32, 3, 98, 52, 37, 8, 2, 80, 50, 99, 71, 5, 7, 35, 56, 47, 40, 2, 8, 56, 69, 15, 76, 52, 24, 56, 89, 52, 30, 70, 68, 71, 17, 4, 39, 39, 85, 29, 18, 71, 92, 8, 1, 95, 52, 94, 71, 88, 59, 64, 100, 96, 65, 15, 89, 19, 63, 38, 50, 65, 52, 26, 46, 79, 85, 32, 12, 67, 35, 22, 54, 81]
new_data = []
for i in range(0, len(data), 5):
every_five = data[i:i+5]
three_out_of_five = every_five[2:5]
new_data.append(mean(three_out_of_five))
print(new_data)
I have a CSV data file, 100 columns * 100,000 lows and one header.
First, I want to make a list containing 1st, 3rd, and 5th to 100,000th columns data of original CSV data file.
In that case, I think I can use the script like below.
#Load data
xy = np.loadtxt('CSV data.csv', delimiter=',', skiprows=1)
x = xy[:,[1,3,5,6,7,8,9,10,11 .......,100000]]
But, as you know, it is not good method. It is difficult to type and it is not good for generalization.
First, I thought the below script could be used but, failed.
x = xy[:,[1,3,5:100000]]
How can I make a separate list using specific columns data, separated and continuous?
np.r_ is a convenience function (actually an object that takes []), that generates an array of indices:
In [76]: np.r_[1,3,5:100]
Out[76]:
array([ 1, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36,
37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70,
71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87,
88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99])
This should be usable for both xy[:,np.r_[...]] and the usecols parameter.
In [78]: np.arange(300).reshape(3,100)[:,np.r_[1,3,5:100:10]]
Out[78]:
array([[ 1, 3, 5, 15, 25, 35, 45, 55, 65, 75, 85, 95],
[101, 103, 105, 115, 125, 135, 145, 155, 165, 175, 185, 195],
[201, 203, 205, 215, 225, 235, 245, 255, 265, 275, 285, 295]])
Just use the usecols parameter in np.loadtxt().:
https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.loadtxt.html
Another option is to define x by removing columns from xy:
x = np.delete(xy, [0,2,4], axis=1)
What is the most efficient and reliable way in Python to split sectors up like this:
number: 101 (may vary of course)
chunk1: 1 to 30
chunk2: 31 to 61
chunk3: 62 to 92
chunk4: 93 to 101
Flow:
copy sectors 1 to 30
skip sectors in chunk 1 and copy 30 sectors starting from sector 31.
and so on...
I have this solved in a "manual" way using modules and basic math but there's got to be a function for this?
Thank you.
I assume that you will have number in a list format. So, in this case if you want very specific format of cluster of number sequence and you know where it should separate then using indexing is the best way as it will have less time complexity. So,you can always create a small code and make it a function to use repeatedly. Something like below:
def sectors(num_seq,chunk_size=30):
...: import numpy as np
...: sectors = int(np.ceil(len(num_seq)/float(chunk_size))) #create number of sectors
...: for i in range(sectors):
...: if i < (sectors - 1):
...: print num_seq[(chunk_size*i):(chunk_size*(i+1))] #All will chunk equal size except the last one.
...: else:
...: print num_seq[(chunk_size*i):] #Takes rest at the end.
Now, every time you want similar thing you can reuse it and it is efficient as you are defining list index value instead of searching through it.
Here is the output:
x = range(1,101)
print sectors(x)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
[31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60]
[61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90]
[91, 92, 93, 94, 95, 96, 97, 98, 99, 100]
Please let me know if this meets your requirement.
Easy and fast(single iteration):
>>> input = range(1, 102)
>>> n = 30
>>> output = [input[i:i+n] for i in range(0, len(input), n)]
>>> output
[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30], [31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60], [61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90], [91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101]]
Another very simple and comprehensive way:
>>> f = lambda x,y: [ x[i:i+y] for i in range(0,len(x),y)]
>>> f(range(1, 102), 30)
[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30], [31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60], [61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90], [91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101]]
You can try using numpy.histogram if you're looking to spit a number into equal sized bins (sectors).
This will create an array of numbers, demarcating each bin boundary:
import numpy as np
number = 101
values = np.arange(number, dtype=int)
bins = np.histogram(values, bins='auto')
print(bins)
Is there an efficient Numpy mechanism to retrieve the integer indexes of locations in an array based on a condition is true as opposed to the Boolean mask array?
For example:
x=np.array([range(100,1,-1)])
#generate a mask to find all values that are a power of 2
mask=x&(x-1)==0
#This will tell me those values
print x[mask]
In this case, I'd like to know the indexes i of mask where mask[i]==True. Is it possible to generate these without looping?
Another option:
In [13]: numpy.where(mask)
Out[13]: (array([36, 68, 84, 92, 96, 98]),)
which is the same thing as numpy.where(mask==True).
You should be able to use numpy.nonzero() to find this information.
If you prefer the indexer way, you can convert your boolean list to numpy array:
print x[nd.array(mask)]
np.arange(100,1,-1)
array([100, 99, 98, 97, 96, 95, 94, 93, 92, 91, 90, 89, 88,
87, 86, 85, 84, 83, 82, 81, 80, 79, 78, 77, 76, 75,
74, 73, 72, 71, 70, 69, 68, 67, 66, 65, 64, 63, 62,
61, 60, 59, 58, 57, 56, 55, 54, 53, 52, 51, 50, 49,
48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36,
35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23,
22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10,
9, 8, 7, 6, 5, 4, 3, 2])
x=np.arange(100,1,-1)
np.where(x&(x-1) == 0)
(array([36, 68, 84, 92, 96, 98]),)
Now rephrase this like :
x[x&(x-1) == 0]