I am trying to write a function that reads an array of parquet files as dataframe and concatenate them sequentially as on file
# Get the list of all files and directories
path = "C:**********\\main_data_1"
dir_list = os.listdir(path)
dir_list
Total_df = []
all_ais_msg_df = []
for j in np.arange(1,len(dir_list)+1):
filename = 'ais_comb_'+ str(j)+ '.parquet'
j +=1
Total_df.append(filename)
for i in Total_df:
** df = pd.read_parquet('main_data_1/'+ str(i))** # having problem here
i+=1
all_ais_msg_df = pd.DataFrame(all_ais_msg_df.append(df))
I intend to construct a string like this ------>'main_data_1/ais_comb_1.parquet' while using the i in the second loop to increment ais_comb_1, ais-comb_2 aspect of the string until the last filename in order to get and combine all as a dataframe
The error I get is:
TypeError: can only concatenate str (not "int") to str
I crave a very clear explanation please, I am still a rookie.
Related
I am trying to read in a series of images from a folder using python. The images represent small pieces of larger image that has been split into a grid and are indexed using "imagename_row_column.jpg" (i.e. img_0_1.jpg). My current code (pasted below) is having trouble with the column index and is counting numbers 10 and above in the incorrect order. For example instead of reading in like (img_0, img_0_1, img_0_2,...img_0_9, img_0_10...) I am getting (img_0, img_0_10, img_0_11, img_0_1, img_0_2...) Any advice would be much appreciated. Thanks!
# Get images from folder
path1 = r'C:\Users\user_\Desktop\Test\IMG_Scan'
images = []
mylist = os.listdir(path1)
for img in mylist:
curimg = cv2.imread(f'{path1}/{img}')
images.append(curimg)
"img_0_10" < "img_0_2" by string comparison.
Try custom sorting your files before iterating:
mylist = sorted(os.listdir(path1), key=lambda x: list(map(int, x.split("_")[1:])))
I have extremely large files. Each file is almost 2GB. Therefore, I would like to run multiple files in parallel. And I can do that because all of the files have similar format therefore, file reading can be done in parallel. I know I should use multiprocessing library but I am really confused how to use it with my code.
My code for file reading is:
def file_reading(file,num_of_sample,segsites,positions,snp_matrix):
with open(file,buffering=2000009999) as f:
###I read file here. I am not putting that code here.
try:
assert len(snp_matrix) == len(positions)
return positions,snp_matrix ## return statement
except:
print('length of snp matrix and length of position vector not the same.')
sys.exit(1)
My main function is:
if __name__ == "__main__":
segsites = []
positions = []
snp_matrix = []
path_to_directory = '/dataset/example/'
extension = '*.msOut'
num_of_samples = 162
filename = glob.glob(path_to_directory+extension)
###How can I use multiprocessing with function file_reading
number_of_workers = 10
x,y,z = [],[],[]
array_of_number_tuple = [(filename[file], segsites,positions,snp_matrix) for file in range(len(filename))]
with multiprocessing.Pool(number_of_workers) as p:
pos,snp = p.map(file_reading,array_of_number_tuple)
x.extend(pos)
y.extend(snp)
So my input to the function is as follows:
file - list containing filenames
num_of_samples - int value
segsites - initially an empty list to which I want to append as I am reading the file.
positions - initially an empty list to which I want to append as I am reading the file.
snp_matrix - initially an empty list to which I want to append as I am reading the file.
The function returns positions list and snp_matrix list at the end. How can I use multiprocessing for this where my arguments are lists and integer? The way I've used multiprocessing gives me following error:
TypeError: file_reading() missing 3 required positional arguments: 'segsites', 'positions', and 'snp_matrix'
The elements in the list that is being passed to the Pool.map are not automatically unpacked. You can generally only have one argument in your 'file_reading' function.
Of course, this argument can be a tuple, so it is no problem to unpack it yourself:
def file_reading(args):
file, num_of_sample, segsites, positions, snp_matrix = args
with open(file,buffering=2000009999) as f:
###I read file here. I am not putting that code here.
try:
assert len(snp_matrix) == len(positions)
return positions,snp_matrix ## return statement
except:
print('length of snp matrix and length of position vector not the same.')
sys.exit(1)
if __name__ == "__main__":
segsites = []
positions = []
snp_matrix = []
path_to_directory = '/dataset/example/'
extension = '*.msOut'
num_of_samples = 162
filename = glob.glob(path_to_directory+extension)
number_of_workers = 10
x,y,z = [],[],[]
array_of_number_tuple = [(filename[file], num_of_samples, segsites,positions,snp_matrix) for file in range(len(filename))]
with multiprocessing.Pool(number_of_workers) as p:
pos,snp = p.map(file_reading,array_of_number_tuple)
x.extend(pos)
y.extend(snp)
Here, my code feats value form text file; and create matrices as multidimensional array, but the problem is the code create more then two dimensional array, that I can't manipulate, I need two dimensional array, how I do that?
Explain algorithm of my code:
Moto of code:
My code fetch value from a specific folder, each folder contain 7 'txt' file, that generate from one user, in this way multiple folder contain multiple data of multiple user.
step1: Start a 1st for loop, and control it using how many folder have in specific folder,and in variable 'path' store the first path of first folder.
step2: Open the path and fetch data of 7 txt file using 2nd for loop.after feats, it close 2nd for loop and execute the rest code.
step3: Concat the data of 7 txt file in one 1d array.
step4(Here the problem arise): Store the 1d arry of each folder as 2d array.end first for loop.
Code:
import numpy as np
from array import *
import os
f_path='Result'
array_control_var=0
#for feacth directory path
for (path,dirs,file) in os.walk(f_path):
if(path==f_path):
continue
f_path_1= path +'\page_1.txt'
#Get data from page1 indivisualy beacuse there string type data exiest
pgno_1 = np.array(np.loadtxt(f_path_1, dtype='U', delimiter=','))
#only for page_2.txt
f_path_2= path +'\page_2.txt'
with open(f_path_2) as f:
str_arr = ','.join([l.strip() for l in f])
pgno_2 = np.asarray(str_arr.split(','), dtype=int)
#using loop feach data from those text file.datda type = int
for j in range(3,8):
#store file path using variable
txt_file_path=path+'\page_'+str(j)+'.txt'
if os.path.exists(txt_file_path)==True:
#genarate a variable name that auto incriment with for loop
foo='pgno_'+str(j)
else:
break
#pass the variable name as string and store value
exec(foo + " = np.array(np.loadtxt(txt_file_path, dtype='i', delimiter=','))")
#z=np.array([pgno_2,pgno_3,pgno_4,pgno_5,pgno_6,pgno_7])
#marge all array from page 2 to rest in single array in one dimensation
f_array=np.concatenate((pgno_2,pgno_3,pgno_4,pgno_5,pgno_6,pgno_7), axis=0)
#for first time of the loop assing this value
if array_control_var==0:
main_f_array=f_array
else:
#here the problem arise
main_f_array=np.array([main_f_array,f_array])
array_control_var+=1
print(main_f_array)
current my code generate array like this(for 3 folder)
[
array([[0,0,0],[0,0,0]]),
array([0,0,0])
]
Note: I don't know how many dimension it have
But I want
[
array(
[0,0,0]
[0,0,0]
[0,0,0])
]
I tried to write a recursive code that recursively flattens the list of lists into one list. It gives the desired output for your case, but I did not try it for many other inputs(And it is buggy for certain cases such as :list =[0,[[0,0,0],[0,0,0]],[0,0,0]])...
flat = []
def main():
list =[[[0,0,0],[0,0,0]],[0,0,0]]
recFlat(list)
print(flat)
def recFlat(Lists):
if len(Lists) == 0:
return Lists
head, tail = Lists[0], Lists[1:]
if isinstance(head, (list,)):
recFlat(head)
return recFlat(tail)
else:
return flat.append(Lists)
if __name__ == '__main__':
main()
My idea behind the code was to traverse the head of each list, and check whether it is an instance of a list or an element. If the head is an element, this means I have a flat list and I can return the list. Else, I should recursively traverse more.
I'm using Python 3.5 to move through directories and subdirectories to access csv files and fill arrays with data from those files. The first csv file the code encounters looks like this:
The code I have is below:
import matplotlib.pyplot as plt
import numpy as np
import os, csv, datetime, time, glob
gpheight = []
RH = []
dewpt = []
temp = []
windspd = []
winddir = []
dirpath, dirnames, filenames = next(os.walk('/strm1/serino/DATA'))
count2 = 0
for dirname in dirnames:
if len(dirname) >= 8:
try:
dt = datetime.datetime.strptime(dirname[:8], '%m%d%Y')
csv_folder = os.path.join(dirpath, dirname)
for csv_file2 in glob.glob(os.path.join(csv_folder, 'figs', '0*.csv')):
if os.stat(csv_file2).st_size == 0:
continue
#create new arrays for each case
gpheight.append([])
RH.append([])
temp.append([])
dewpt.append([])
windspd.append([])
winddir.append([])
with open(csv_file2, newline='') as f2_input:
csv_input2 = csv.reader(f2_input,delimiter=' ')
for j,row2 in enumerate(csv_input2):
if j == 0:
continue #skip header row
#fill arrays created above
windspd[count2].append(float(row2[5]))
winddir[count2].append(float(row2[6]))
gpheight[count2].append(float(row2[1]))
RH[count2].append(float(row2[4]))
temp[count2].append(float(row2[2]))
dewpt[count2].append(float(row2[3]))
count2 = count2 + 1
except ValueError as e:
pass
I have it set up to create a new array for each new csv file. However, when I print the third (temperature) column,
for n in range(0,len(temp)):
print(temp[0][n])
it only partially prints that column of data:
-70.949997
-68.149994
-60.449997
-63.649994
-57.449997
-51.049988
-45.349991
-40.249985
-35.549988
-31.249985
-27.149994
-24.549988
-22.149994
-19.449997
-16.349976
-13.25
-11.049988
-8.949982
-6.75
-4.449982
-2.25
-0.049988
In addition, I believe a related problem is that when I simply do,
print(temp)
it prints
with the highlighted section the section that belongs to this one csv file, and should therefore be in one array. There are also additional empty arrays at the end that should not be there.
I have (not shown) a section of code before this that does the same thing but with different csv files, and that works as expected, separating each file's data into a new array, with no empty arrays. I appreciate any help!
The issue had been my use of try and pass. All the files that matched my criteria were met, but some of those files had issues with how their contents were read, which caused the errors I was receiving later in the code. For anyone looking to use try and pass, make sure that you are able to safely pass on any exceptions that block of code may encounter. Otherwise, it could cause problems later. You may still get an error if you don't pass on it, but that will force you to fix it appropriately instead of ignoring it.
I have a large number of .asc files containing (x,y) coordinates for two given satellites. There are approximately 3,000 separate files for each satellite (e.g. Satellite1 = [file1,file2,..., file3000] and Satellite2= [file1,file2,..., file3000]).
I'm trying to write some code in Python (version 2.7.8 |Anaconda 2.0.) that finds the multiple points on the Earth's surface where both satellite tracks crossover.
I've written some basic code that takes two files as input(i.e. one from Sat1 and one from Sat2) using loadtxt. In a nutshell, the code looks like this:
sat1_in = loadtxt("sat1_file1.asc", usecols = (1,2), comments = "#")
sat2_in = loadtxt("sat2_file1.asc", usecols = (1,2), comments = "#")
def main():
xover_search() # Returns True or False whether a crossover is found.
xover_final() # Returns the (x,y) coordinates of the crossover.
write_output() # Appends this coordinates to a txt file for later display.
if __name__ == "__main__":
main()
I would like to implement this code to the whole dataset, using a function that outputs "sat1_in" and "sat2_in" for all possible combinations of files between satellite 1 and satellite 2. These are my ideas so far:
#Create two empty lists to store all the files to process for Sat1 and Sat2:
sat1_files = []
sat2_files = []
#Use os.walk to fill each list with the respective file paths:
for root, dirs, filenames in os.walk('.'):
for filename in fnmatch.filter(filenames, 'sat1*.asc'):
sat1_files.append(os.path.join(root, filename))
for root, dirs, filenames in os.walk('.'):
for filename in fnmatch.filter(filenames, 'sat2*.asc'):
sat2_files.append(os.path.join(root, filename))
#Calculate all possible combinations between both lists using itertools.product:
iter_file = list(itertools.product(sat1_files, sat2_files))
#Extract two lists of files for sat1 and sat2 to be compared each iteration:
sat1_ordered = [seq[0] for seq in iter_file]
sat2_ordered = [seq[1] for seq in iter_file]
And this is where I get stuck. How to iterate through "sat1_ordered" and "sat2_ordered" using loadtxt to extract the lists of coordinates for every single file?The only thing I have tried is:
for file in sat1_ordered:
sat1_in = np.loadtxt(file, usecols = (1,2),comments = "#")
But this will create a huge list containing all the measurements for satellite 1.
Could someone give me some ideas about how to tackle this problem?
Maybe you are searching something like that:
for file1, file2 in iter_file:
sat1_in = np.loadtxt(file1, usecols = (1,2),comments = "#")
sat2_in = np.loadtxt(file2, usecols = (1,2),comments = "#")
....