How can i combine several database files with numpy?

How can i combine several database files with numpy? - python

I know that I can can read a file with numpy with the genfromtxt command. It works like this:
data = numpy.genfromtxt('bmrbtmp',unpack=True,names=True,dtype=None)
I can plot the stuff in there easily with:
ax.plot(data['field'],data['field2'], linestyle=" ",color="red")
or
ax.boxplot(data)
and its awesome. What I really would like to do now is read a whole folder of files and combine them into one giant dataset. How do I add datapoints to the data data structure?
And how do I read a whole folder at once?

To visit all the files in a directory, use os.walk.
To stack two structured numpy arrays "vertically", use np.vstack.
To save the result, use np.savetxt to save in a text format, or np.save to save the array in a (smaller) binary format.
import os
import numpy as np
result = None
for root, dirs, files in os.walk('.', topdown = True):
for filename in files:
with open(os.path.join(root, filename), 'r') as f:
data = np.genfromtxt(f, unpack=True, names=True, dtype=None)
if result is None:
result = data
else:
result = np.vstack((result, data))
print(result[:10]) # print first 10 lines
np.save('/tmp/outfile.npy', result)

Related

Merging multiple .npz files into single .npz file

I have multiple .npz files in folder with same nature, I want to append all of my .npz files into a single .npz file present in a given folder
I have tried below code to achieve so, but it seems its not appending multiple .npz files to single npz file.
Here is the code
import numpy as np
file_list = ['image-embeddings\img-emb-1.npz', 'image-embeddings\img-emb-2.npz']
data_all = [np.load(fname) for fname in file_list]
merged_data = {}
for data in data_all:
[merged_data.update({k: v}) for k, v in data.items()]
np.savez('new_file.npz', **merged_data)
Where img-emb-1.npz has different value and img-emb-2.npz has different value

Maybe try the following to construct merged_data:
arrays_read = dict(
chain.from_iterable(np.load(file(arr_name)).items() for arr_name in arrays.keys())
)
Full example:
from itertools import chain
import numpy as np
file = lambda name: f"arrays/{name}.npz"
# Create data
arrays = {f"arr{i:02d}": np.random.randn(10, 20) for i in range(10)}
# Save data in separate files
for arr_name, arr in arrays.items():
np.savez(file(arr_name), **{arr_name: arr})
# Read all files into a dict
arrays_read = dict(
chain.from_iterable(np.load(file(arr_name)).items() for arr_name in arrays.keys())
)
# Save into a single file
np.savez(file("arrays"), **arrays_read)
# Load to compare
arrays_read_single = dict(np.load(file("arrays")).items())
assert arrays_read.keys() == arrays_read_single.keys()
for k in arrays_read.keys():
assert np.array_equal(arrays_read[k], arrays_read_single[k])

Import text files to dataframe, each text file on single row, save as csv

Still quite new to this and am struggling.
I have a directory of a few hundred text files, each file has thousands of lines of information on it.
Some lines contain one number, some many
example:
39 312.000000 168.871795
100.835446
101.800298
102.414406
104.491999
108.855079
107.384008
103.608815
I need to pull all of the information from each text file, I want the name of the text file (minus the '.txt') to be in the first column, and all other information following that to complete the row (regardless of its layout within the file)
import pandas as pd
import os
data= '/path/to/data/'
path='/other/directory/path/'
lst=['list of files needed']
for dirpath, dirs, subj in os.walk(data):
while i<=5: #currently being used to break before iterating through entire directory to check it's working
with open(dirpath +lst[i], 'r') as file:
info=file.read().replace('\n', '') #txt file onto one line
corpus.append(lst[i]+' ') #begin list with txt file name
corpus.append(info) #add file contents to list after file name
output=''.join(corpus) #get out of list format
output.split()
i+=1
df=pd.read_table(output, lineterminator=',')
df.to_csv(path + 'testing.csv')
if i >5:
break
Currently, this is printing Errno 2 (no such file or directory) then goes on to print the contents of the first file and no others, but does not save it to csv.
This also seems horribly convoluted and I'm sure there's another way of doing it
I also suspect the lineterminator will not force each new text file onto a new row, so any suggestions there would be appreciated
desired output:
file1 39 312.000 168.871
file2 72 317.212 173.526

You are loading os and pandas so you can take advantage of their functionality (listdir, path, DataFrame, concat, and to_csv) and drastically reduce your code's complexity.
import os
import pandas as pd
data='data/'
path='output/'
files = os.listdir(data)
output = pd.DataFrame()
for file in files:
file_name = os.path.splitext(file)[0]
with open(os.path.join(data, file)) as f:
info = [float(x) for x in f.read().split()]
#print(info)
df = pd.DataFrame(info, columns=[file_name], index = range(len(info)))
output = pd.concat([output, df], axis=1)
output = output.T
print(output)
output.to_csv(path + 'testing.csv', index=False)
I would double-check that your data folder only has txt files. And, maybe add a check for txt files to the code.
This got less elegant as I learned about the requirements. If you want to flip the columns and rows, just take out the output.T line. This transposes the dataframe.

What is the best method of plotting the average line/data of multiple CSV files?

I am currently working with 9 different csv files all testing similar samples of a material. The output of data looks similar to this:
Time,Displacement,Force,Flexure stress,Flexure strain (Displacement)
(s),(mm),(N),(MPa),(%)
"0.0000","0.0000","0.0007","0.0000","0.0000"
"0.0200","0.0000","0.0069","0.0004","0.0000"
"0.0400","0.0001","-0.0024","-0.0001","0.0003"
"0.0600","0.0005","0.0040","0.0002","0.0014"
"0.0800","0.0014","0.0106","0.0006","0.0041"
I was able to plot each file on the same plot using this code I have put together from several sources:
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
### Set path to the folder containing the .csv files
PATH = 'my path'
### Fetch all files in path
fileNames = os.listdir(PATH)
### Filter file name list for files ending with .csv
fileNames = [file for file in fileNames if '.csv' in file]
### Loop over all files
for file in fileNames:
### Read .csv file and append to list
df = pd.read_csv(PATH + file, usecols = [3, 4], skiprows=2, names=['Stress', 'Strain'], header=None)
strain_df = df['Strain']*0.01
side = 6.2 #mm
stress_df = (df['Stress'])/(side**2) # N/mm**2
### Create line for every file
plt.plot(strain_df, stress_df)
### Generate the plot
plt.xlabel(r'Strain $\epsilon$ (mm/mm)')
plt.ylabel(r'Stress $\sigma$ (N/mm$^2$)')
plt.title(r'Stress Strain Curve - 4$^\circ$C/min ')
plt.show()
This code block then gives a plot like this:
I am fine with how that ended up but I would really like to add a average line which I can add to this plot and also to compare to other materials tested in a similar way instead of having 40+ lines on the same plot. I'm not sure if the best way of doing this would be to create a new csv file that takes the average value of each row and column or if there is a way I can find the average in the loop I created. Any Tips would be greatly appreciated!
This is something similar to what I am looking for (the black line in the bottom plot):

Here is my solution and image result.
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
### Set path to the folder containing the .csv files
PATH = './'
### Fetch all files in path
fileNames = os.listdir(PATH)
### Filter file name list for files ending with .csv
fileNames = [file for file in fileNames if '.csv' in file]
count_fileNames = len(fileNames)
list_strain = []
list_stress = []
### Loop over all files
for file in fileNames:
print("file: ", file)
### Read .csv file and append to list
df = pd.read_csv(PATH + file, usecols = [3, 4], skiprows=2, names=['Stress', 'Strain'], header=None, encoding= 'unicode_escape')
strain_df = df['Strain']*0.01
side = 6.2 #mm
stress_df = (df['Stress'])/(side**2) # N/mm**2
list_strain.append(strain_df)
list_stress.append(stress_df)
### Create line for every file
plt.plot(strain_df, stress_df)
mean_strain = sum(list_strain) / count_fileNames
mean_stress = sum(list_stress) / count_fileNames
plt.plot(mean_strain, mean_stress, 'k')
### Generate the plot
plt.xlabel(r'Strain $\epsilon$ (mm/mm)')
plt.ylabel(r'Stress $\sigma$ (N/mm$^2$)')
plt.title(r'Stress Strain Curve - 4$^\circ$C/min ')
plt.show()
result

Python combine CSVs, remove header and remove blanks

I’m extremely new to Python & trying to figure the below out:
I have multiple CSV files (monthly files) that I’m trying to combine into a yearly file. The monthly files all have headers, so I’m trying to keep the first header & remove the rest. I used the below script which accomplished this, however there are 10 blank rows between each month.
Does anyone know what I can add to this to remove the blank rows?
import shutil
import glob
#import csv files from folder
path = r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
allFiles.sort() # glob lacks reliable ordering, so impose your own if output order matters
with open('someoutputfile.csv', 'wb') as outfile:
for i, fname in enumerate(allFiles):
with open(fname, 'rb') as infile:
if i != 0:
infile.readline() # Throw away header on all but first file
# Block copy rest of file from input to output without parsing
shutil.copyfileobj(infile, outfile)
print(fname + " has been imported.")
Thank you in advance!

assuming the dataset isn't bigger than you memory, I suggest reading each file in pandas, concatenating the dataframes and filtering from there. blank rows will probably show up as nan.
import pandas as pd
import glob
path = r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
allFiles.sort()
df = pd.Dataframe()
for i, fname in enumerate(allFiles):
#append data to existing dataframe
df = df.append(pd.read(fname), ignore_index = True)
#hopefully, this will drop blank rows
df = df.dropna(how = 'all')
#write to file
df.to_csv('someoutputfile.csv')

List all ".csv" filename and then enter corresponding code to plot the graph

There is a folder which contains lot of data for example the folder contains ".html" files , ".jpeg" files, ".pdf"files, ".csv" files(There are plenty of ".csv" excel sheets in the folder containing different file names). Here is the code which list only csv files.
Is there a way where in when i list all ".csv" files i will enter the corresponding code to plot graph.
import os
path = "F:\\Users\\Desktop\\Data\\Summary"
files = []
# r=root, d=directories, f = files
for r, d, f in os.walk(path):
for file in f:
if '.csv' in file:
files.append(os.path.join(r, file))
for f in files:
print(f)
When i run the above code i get output as:
F:\\Users\\Desktop\\Data\\Summary\Test_Summary_1.csv
F:\\Users\\Desktop\\Data\\Summary\Test_Summary_2.csv
F:\\Users\\Desktop\\Data\\Summary\Test_Summary_3.csv
Actually i want Output to be displayed as:
0-Test_Summary_1.csv
1-Test_Summary_2.csv
2-Test_Summary_3.csv
3-Test_Summary_4.csv
4-Test_Summary_5.csv
5-Test_Summary_6.csv etc
How do i modify it to get as said above??

If you run into trouble that the files aren't listed in the right order you can just sort the list of filenames as follows:
>>> x = ['abc_1.csv', 'abc_2.csv', 'abc_0.csv']
>>> x.sort()
>>> x
['abc_0.csv', 'abc_1.csv', 'abc_2.csv']
If you know which csv data you'd like to plot. You can read the file into a numpy array as follows
from numpy import loadtxt
data = loadtxt(filename, delimiter=',')
Then you can just plot the data using matplotlib
import matplotlib.pyplot as plt
plt.plot(data[:,0], data[:,1], 'ro')
plt.show()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can i combine several database files with numpy? - python

Related

Merging multiple .npz files into single .npz file

Import text files to dataframe, each text file on single row, save as csv

What is the best method of plotting the average line/data of multiple CSV files?

Python combine CSVs, remove header and remove blanks

List all ".csv" filename and then enter corresponding code to plot the graph

Categories

Resources