how to merge multiple csv files with ​different headers? - python

I have some image datasets and I want to convert them to CSV file by using np.savetxt, but I couldn't find any way to combine them to one csv file. When I combine dataset vectors with "np.array", enter image description here It is being something like this. And when I try to merge multiple csv files, even they have different header names, they are combined in the same headers but I don't want it. Are there anyway to combine them or just save them as one file by np.savetxt?
(btw really sorry for my English and my question,I'm new at stackoverflow)
For example I have these two csv files (enter image description here,enter image description here) And I want something like thisenter image description here(but for multiple files
here is my code
while x!=y:
img=Image.open(f"0_resized/{x}.jpg").convert("L")
arr = np.array(img)
shape = arr.shape
flat_arr = arr.ravel()
np.savetxt(f"{x}.csv",flat_arr,fmt="%d")
x+=1

Instead of creating several .csv files and combining them, we can create a list with the images and save it to one .csv file. To do so, we can make some little modifications to your code, as shown bellow:
list_arrays = []
while x!=y:
img=Image.open(f"0_resized/{x}.jpg").convert("L")
arr = np.array(img)
shape = arr.shape
flat_arr = arr.ravel().tolist()
list_arrays.append(flat_arr)
x+=1
final_arrays = np.asarray(list_arrays)
np.savetxt("images.csv", final_arrays.T, delimiter=",")
In the code above, we created a list called list_arrays where we save the flat arrays created in the while loop. After reading all images and saving their flat version in our list, we can transform it in an array, using the np.asarray method.
The key point here is to save not the array, but the transposed array (final_arrays.T), which puts each image in a column.

Related

How to adjust text file saved by python?

I'm trying to save data from an array called Sevol. This matrix has 100 rows and 1000 columns, so len(Sevol[i]) has 1000 elements and Sevol[0][0] would be the first element of the first list.
I tried to save this array with the commands
np.savetxt(path + '/data_Sevol.txt', Sevol[i], delimiter=" ")
It works fine. However, I would like the file to be organized as an array anyway. For example, currently, the file is being saved like this in Notepad:
And I would like the data to remain organized, as for example in this file:
Is there an argument in the np.savetxt function or something I can do to better organize the text file?

Is it possible to append data to JSON file?

I am generating some arrays for a simulation and I want save them in a JSON file. I am using the jsonpickle library.
The problem is that the arrays I need to save can be very large in size (hundreds of MB up to some GB). Thus, I need to save each array to the JSON file immediately after its generation.
Basically, I am generating a multiple independent large arrays, storing them in another array and saving them into the JSON file after all of them have been generated:
N = 1000 # Number of arrays
z_NM = np.zeros((64000,1), dtype=complex) # Large array
z_NM_array = np.zeros((N,64000,1), dtype=complex) # Array of z_NM arrays
for in range(0, N):
z_NM[:,0] = GenerateArray() # Generate array and store it in z_NM_array
z_NM_array[i] = z_NM
# Write data to JSON file
data = {"z_NM_array": z_NM_array}
outdata = json.encode(data)
with open(filename, "wb+") as f:
f.write(outdata.encode("utf-8"))
f.close()
I was wondering if it is instead possible to append the new data to the existing JSON file, by writing each array to the file immediately after its generation, inside the for loop? If so, how? And how can it be read back? Maybe using a library different from jsonpickle?
I know I could save each array in a separate file, but I'm wondering if there's a solution that lets me use a single file. I also have some settings in the dict which I want to save along with the array.

Convert image to numpy array, save it into Excel and reverse the all

I tried my best to solve this issue on my own and I think I've hit a roadblock as it stands so far. I'm new to Python and so far so good on a web scraping project I'm trying to complete. What I'm currently trying to accomplish is to take an image, convert it into something readable by Pandas like a text string, store that into a single Excel cell, and then later convert it back from the text into the image for a finished product in Excel.
I tried a few different methods like base64 which works for conversions to and from the image, but exceeds my single Excel cell desire. I'm currently an venture where I can store the image as an NumPy array into the Pandas dataframe and write that to excel which seems to work as it retains the numbers and structure, but I'm having issues reimporting it back into NumPy (I'm pretty sure it's an issue of going from Integers to Strings and trying to go back without really knowing how).
The initial dtype image array upon conversion from image to array is uint8
The stored Excel text string of the array when brought back into NumPy is U786. I've tried reconverting the string in NumPy, but I can't figure out how to do it.
A few potential roadblocks:
The image is a screenshot from Selenium that I am saving from my web scraping
I'm writing all my scraped contents to include the image screenshot converted to an array from a Pandas dataframe to the Excel at one time.
I'm using Xlsxwriter as my Excel Writer to write the dataframe and would to continue doing so if possible.
Below is an example of the code I'm using for this project. I'm open to any and all potential approaches that would fix this issue.
import numpy as np
from PIL import Image
import openpyxl as Workbook
import pandas as pd
import matplotlib
#----Opens Image of interest, adds text, and appends to dataframe
MyDataTable = [] #Datatable to write to Excel
ExampleTextString = "Text for Example" #Only used as without it Pandas gives an error of not passing 2D array when saving to excel
MyDataTable.append(ExampleTextString)
img = Image.open('example.png') # uses PIL library to open image in memory
imgtoarray = np.array(img) # imgtoarray.shape: height x width x channel
MyDataTable.append(imgtoarray) #adds my array to dataframe
#----Check my array image with matplotlib
matplotlib.pyplot.imshow(imgtoarray)
#----Pandas & Excelwriter to Excel
df = pd.DataFrame(MyDataTable)
df.to_excel('ExampleSpreadsheet.xlsx', engine="xlsxwriter", header=False, index=False)
#------Open Array Test Data and where NumPy Array is Saved-----
wb = Workbook.load_workbook(filename='ExampleSpreadsheet.xlsx')
sheet_ranges = wb['Sheet1']
testarraytoimg = sheet_ranges['A2'].value
print (testarraytoimg)

Python write multiple numpy array to csv file

I am trying to add features of multiple images by converting them from raw data format to .csv.
I have read and displayed features of two images via print function but during addition of contents to csv, i am only able to add single numpy array. I want to add few thousand images in same csv.
Below is printed output, but csv only shows one array (having features of single image).
Image showing code and output
I have done this by using following code:
with open("output.csv", "a") as f:
writer = csv.writer(f)
writer.writerows(data_read[2])

Pyspark: write df to file with specific name, plot df

I'm working with lastest version of Spark(2.1.1). I read multiple csv files to dataframe by spark.read.csv.
After processing with this dataframe, How can I save it to output csv file with specific name.
For example, there are 100 input files (in1.csv,in2.csv,in3.csv,...in100.csv).
The rows that belong to in1.csv should be saved as in1-result.csv. The rows that belong to in2.csv should be saved as in2-result.csv and so on. (The default file name will be like part-xxxx-xxxxx which is not readable)
I have seen partitionBy(col) but look like it can just partition by column.
Another question is I want to plot my dataframe. Spark has no built-in plot library. Many people use df.toPandas() to convert to pandas and plot it. Is there any better solution? Since my data is very big and toPandas() will cause memory error. I'm working on the server and want to save the plot as image instead of showing.
I suggest below solution for writing DataFrame in specific directories related to input file:
in loop for each file:
read csv file
add new column with information about input file using withColumn tranformation
union all DataFrames using union transformation
do required preprocessing
save result using partitionBy by providing column with input file information, so that rows related to the same input file will be saved in the same output directory
Code could look like:
all_df = None
for file in files: # where files is list of input CSV files that you want to read
df = spark.read.csv(file)
df.withColumn("input_file", file)
if all_df is None:
all_df = df
else:
all_df = all_df.union(df)
# do preprocessing
result.write.partitionBy(result.input_file).csv(outdir)

Categories

Resources