I have a data in an array (B=[1,2,3,4,5]) which i took from the DataTable i tried python for loop to imported in excel file using this code:
def Cells(a,b):
return str(chr(b+96) + str(a))
import clr
clr.AddReference("Microsoft.Office.Interop.Excel")
import Microsoft.Office.Interop.Excel as Excel
ex = Excel.ApplicationClass()
ex.Visible = True
workbook = ex.Workbooks.Open(r"F:\Programming\Excel\Plot data.xlsx")
worksheet=workbook.worksheets("Sheet1")
Adding_Max_Principal_Stress=Model.Analyses[0].Solution.AddMaximumPrincipalStress()
Model.Analyses[0].Solution.EvaluateAllResults()
A=Adding_Max_Principal_Stress.PlotData
B=A.Values[1]
C=B.Count
for i in range (C):
E=B[i]
worksheet.range(Cells(1+i,1)).Value=E
In this code, B contains the list of data like B=[1,2,3,4,5,...] the items in B has 100000 data(array values) which takes 5 hrs to import the data in excel. Is there any possibility that i can speed this process.
Im guessing IronPython doesnt support too many different Python libraries? In which case, the code needs to be almost pure python?
Have you used a line profiler to see what the slowest part of the
code is?
It seems like in your .range formula you are only setting one cell at
a time, have you tried setting all of the values at once? Each time that the VBA code has to interact with excel, it is very slow. If you can set all of the values of the cells at once, it will be much faster. I know how to do this in VBA code,
Do you have to save as an excel file? Can you save as a CSV file instead? Creating large excel files is slow.
Related
I'm currently using the following line to read Excel files
df = pd.read_excel(f"myfile.xlsx")
The problem is the enormous slow down which occurs when I implement data from this Excel file, for example in function commands. I think this occurs because I'm not reading the file via a context manager. Is there a way of combining a 'with' command with the pandas 'read' command so the code runs more smoothly? Sorry that this is vague, I'm just learning about context managers.
Edit : Here is an example of a piece of code that does not run...
import pandas as pd
import numpy as np
def fetch_excel(x):
df_x = pd.read_excel(f"D00{x}_balance.xlsx")
return df_x
T = np.zeros(3000)
for i in range(0, 3000):
T[i] = fetch_excel(1).iloc[i+18, 0]
print(fetch_excel(1).iloc[0,0])
...or it takes more than 5 minutes which seems exceptional to me. Anyway I can't work with a delay like that. If I comment out the for loop, this does work.
Usually the key reason to use standard context managers for reading in files is convenience of closing and opening the underlying file descriptor. You can create context managers to do anything you'd like, though. They're just functions.
Unfortunately they aren't likely to solve the problem of slow loading times reading in your excel file.
You are accessing the HDD, opening, reading and converting the SAME file D001_balance.xlsx 3000 times to access a single piece of data - different row each time from 18 to 3017. This is pointless as the data is all in the DataFrame after one reading. Just use:
df_x = pd.read_excel(f"D001_balance.xlsx")
T = np.zeros(3000)
for i in range(0, 3000):
T[i] = df_x.iloc[i+18, 0]
print(df_x.iloc[0,0])
I am trying to overwrite a value in a given cell using openpyxl. I have two sheets. One is called Raw, it is populated by API calls. Second is Data that is fed off of Raw sheet. Two sheets have exactly identical shape (cols/rows). I am doing a comparison of the two to see if there is a bay assignment in Raw. If there is - grab it to Data sheet. If both Raw and Data have the value in that column missing - then run a complex Algo (irrelevant for this question) to assign bay number based on logic.
I am having problems with rewriting Excel using openpyxl.
Here's example of my code.
data_df = pd.read_excel('Algo Build v23test.xlsx', sheet_name='MondayData')
raw_df = pd.read_excel('Algo Build v23test.xlsx', sheet_name='MondayRaw')
no_bay_res = data_df[data_df['Bay assignment'].isnull()].reset_index() #grab rows where there is no bay assignment in a specific column
book = load_workbook("Algo Build v23test.xlsx")
sheet = book["MondayData"]
for index, reservation in no_bay_res.iterrows():
idx = int(reservation['index'])
if pd.isna(raw_df.iloc[idx, 13]):
continue
else:
value = raw_df.iat[idx,13]
data_df.iloc[idx, 13] = value
sheet.cell(idx+2, 14).value = int(value)
book.save("Algo Build v23test.xlsx")
book.close()
print(value) #302
Now the problem is that it seems that book.close() is not working. Book is still callable in python. Now, it overwrites Excel totally fine. However, if I try to run these two lines again
data_df = pd.read_excel('Algo Build v23test.xlsx', sheet_name='MondayData')
raw_df = pd.read_excel('Algo Build v23test.xlsx', sheet_name='MondayRaw')
I am getting datasets full of NULL values, except for the value that was replaced. (attached the image).
However, if I open that Excel file manually from the folder and save it (CTRL+S) and try running the code again - it works properly. Weirdest problem.
I need to loop the code above for Monday-Sunday, so I need it to be able to read the data again without manually resaving the file.
Due to some reason, pandas will read all the formulas as NaN after the file been used in the script by openpyxl until the file has been opened, saved and closed. Here's the code that helps do that within the script. However, it is rather slow.
import xlwings as xl
def df_from_excel(path, sheet_name):
app = xl.App(visible=False)
book = app.books.open(path)
book.save()
app.kill()
return pd.read_excel(path, sheet_name)
I got the same problem, the only workaround I found is to terminate the excel.exe manually from taskmanager. After that everything went fine.
I'm currently working with functional MRI data in R but I need to import it to Python for some faster analysis. How can I do that in an efficient way?
I currently have in R a list of 198135 dataframes. All of them have 5 variables and 84 observations of connectivity between brain regions. I need to display the same 198135 dataframes in Python for running some specific analysis there (with the same structure than in R: one object that contains all dataframes separately).
Initially I tried exporting a .RDS file from R and then importing it to Python using "pyreadr", but I'm getting empty objects in every atempt with "pyreadr.read_r()" function.
My other method was to save every dataframe of the R list as a separate .csv file, and then importing them to Python. In that way I could get what I wanted (I tried it with 100 dataframes only for trying the code). The problem with this method is that is highly inefficient and slow.
I found several answers to similar problems, but most of them were to merge all dataframes and load it as a unique .csv into Python, which is not the solution I need.
Is there some more efficient way to do this process, without altering the data structure that I mentioned?
Thanks for your help!
# This is the code in R for an example
a <- as.data.frame(cbind(c(1:3), c(1:3), c(4:6), c(7:9)))
b <- as.data.frame(cbind(c(11:13), c(21:23), c(64:66), c(77:79)))
c <- as.data.frame(cbind(c(31:33), c(61:63), c(34:36), c(57:59)))
d <- as.data.frame(cbind(c(12:14), c(13:15), c(54:56), c(67:69)))
e <- as.data.frame(cbind(c(31:33), c(51:53), c(54:56), c(37:39)))
somelist_of_df <- list(a,b,c,d,e)
saveRDS(somelist_of_df, "somefile.rds")
## This is the function I used from pyreadr in Python
import pyreadr
results = pyreadr.read_r('/somepath/somefile.rds')
Well, thanks for the help in the other answers, but it's not exactly what I was looking for(I wanted to export just one file with the list of dataframes within it, and then loading one single file to Python, keeping the same structure). For using feather you have to decompose the list in all the dataframes within it, pretty much like saving separate .csv files, and then load each one of them into Python (or R). Anyway, it must be said that it's much faster than the method with .csv.
I leave the code that I used successfully in a separate answer, maybe it could be useful for other people since I used a simple loop for loading dataframes into Python as a list:
## Exporting a list of dataframes from R to .feather files
library(feather) #required package
a <- as.data.frame(cbind(c(1:3), c(1:3), c(4:6), c(7:9))) #Example DFs
b <- as.data.frame(cbind(c(11:13), c(21:23), c(64:66), c(77:79)))
c <- as.data.frame(cbind(c(31:33), c(61:63), c(34:36), c(57:59)))
d <- as.data.frame(cbind(c(12:14), c(13:15), c(54:56), c(67:69)))
e <- as.data.frame(cbind(c(31:33), c(51:53), c(54:56), c(37:39)))
somelist_of_df <- list(a,b,c,d,e)
## With sapply you loop over the list for creating the .feather files
sapply(seq_along(1:length(somelist_of_df)),
function(i) write_feather(somelist_of_df[[i]],
paste0("/your/directory/","DF",i,".feather")))
(Using just a MacBook Air, the code above took less than 5 seconds to run for a list of 198135 DFs)
## Importing .feather files into a list of DFs in Python
import os
import feather
os.chdir('/your/directory')
directory = '/your/directory'
py_list_of_DFs = []
for filename in os.listdir(directory):
DF = feather.read_dataframe(filename)
py_list_of_DFs.append(DF)
(This code did the work for me besides it was a bit slow, it took 12 minutes to do the task for the 198135 DFs)
I hope this could be useful for somebody.
This package may be of some interest to you
Pandas also implements a direct way to read .feather file :
pd.read_feather()
Pyreadr cannot currently read R lists, therefore you need to save the dataframes individually, also you need to save to a RDA file so that you can host multiple dataframes in one file:
# first construct a list with the names of dataframes you want to save
# instead of the dataframes themselves
somelist_of_df <- list("a", "b", "c", "d", "e")
do.call("save", c(somelist_of_df, file="somefile.rda"))
or any other variant as described here.
Then you can read the file in python:
import pyreadr
results = pyreadr.read_r('/somepath/somefile.rda')
The advantage is that there will be only one file with all dataframes.
I cannot comment in the #crlagos0 answer because reputation. I Want to add a couple of things:
seq_along(list_of_things) is enough, there is no need to do seq_along(lenght(1:list_of_things)) in R. Also, I want to point out that the official package to read and write feather files in R is called arrow and you can find its documentation here. In python is pyarrow.
I have hundred of thousands of data text files to read. As of now, I'm importing the data from text files every time I run the code. Perhaps the easy solution would be to simply reformat the data into a file faster to read.
Anyway, right now every text files I have look like:
User: unknown
Title : OE1_CHANNEL1_20181204_103805_01
Sample data
Wavelength OE1_CHANNEL1
185.000000 27.291955
186.000000 27.000877
187.000000 25.792290
188.000000 25.205620
189.000000 24.711882
.
.
.
The code where I read and import the txt files is:
# IMPORT DATA
path = 'T2'
if len(sys.argv) == 2:
path = sys.argv[1]
files = os.listdir(path)
trans_import = []
for index, item in enumerate(files):
trans_import.append(np.loadtxt(path+'/'+files[1], dtype=float, skiprows=4, usecols=(0,1)))
The resulting array looks in the variable explorer as:
{ndarray} = [[185. 27.291955]\n [186. 27.000877]\n ... ]
I'm wondering, how I could speed up this part? It takes a little too long as of now just to import ~4k text files. There are 841 lines inside every text files (spectrum). The output I get with this code is 841 * 2 = 1682. Obviously, it considers the \n as a line...
It would probably be much faster if you had one large file instead of many small ones. This is generally more efficient. Additionally, you might get a speedup from just saving the numpy array directly and loading that .npy file in instead of reading in a large text file. I'm not as sure about the last part though. As always when time is a concern, I would try both of these options and then measure the performance improvement.
If for some reason you really can't just have one large text file / .npy file, you could also probably get a speedup by using, e.g., multiprocessing to have multiple workers reading in the files at the same time. Then you can just concatenate the matrices together at the end.
Not your primary question but since it seems to be an issue - you can rewrite the text files to not have those extra newlines, but I don't think np.loadtxt can ignore them. If you're open to using pandas, though, pandas.read_csv with skip_blank_lines=True should handle that for you. To get a numpy.ndarray from a pandas.DataFrame, just do dataframe.values.
Let use pandas.read_csv (with C speed) instead of numpy.loadtxt. This is a very helpful post:
http://akuederle.com/stop-using-numpy-loadtxt
I have a stack of CT-scan images. After processing (one image from those stack) CT-scan image using Matlab, I saved XY coordinates for each different boundary region in different Excel sheets as follows:
I = imread('myCTscan.jpeg');
BW = im2bw(I);
[coords, labeledImg] = bwboundaries(BW, 4, 'holes');
sheet = 1;
for n=1:length(coords);
xlswrite('fig.xlsx',coords{n,1},sheet,'A1');
sheet = sheet+1;
end
The next step is then to import this set of coordinates and plot it into Abaqus CAE Sketch for finite element analysis.
I figure out that my workflow is something like this:
Import Excel workbook
For each sheet in workbook:
2.1. For each row: read both column to get xy coordinates (each row has two column, x and y coordinate)
2.2. Put each xy coordinates inside a list
2.3. From list, sketch using spline method
Repeat step 2 for other sheets within the workbook
I searched for a while and found something like this:
from abaqus import *
lines= open('fig.xlsx', 'r').readlines()
pointList= []
for line in lines:
pointList.append(eval('(%s)' %line.strip()))
s1= mdb.models['Model-1'].ConstrainedSketch(name='mySketch', sheetSize=500.0)
s1.Spline(points= pointList)
But this only read XY coordinates from only one sheet and I'm stuck at step 3 above. Thus my problem is that how to read these coordinates in different sheets using Abaqus/Python (Abaqus 6.14, Python 2.7) script?
I'm new to Python programming, I can read and understand the syntax but can't write very well (I'm still struggling on how to import Python module in Abaqus). Manually type each coordinates (like in Abaqus' modelAExample.py tutorial) is practically impossible since each of my CT-scan image can have 100++ of boundary regions and 10k++ points.
I'm using:
Windows 7 x64
Abaqus 6.14 (with built in Python 2.7)
Excel 2013
Matlab 2016a with Image Processing Toolbox
You are attempting to read excel files as comma separated files. CSV files by definition can not have more than one tab. Your read command is interpreting the file as a csv and not allowing you to iterate over the tabs in your file (though it begs the question how your file is opening properly in the first place as you are saving an xlsx and reading a csv).
There are numerous python libraries that will parse and process XLS/XLSX files.
Take a look at pyxl and use it to read your file in.
You would likely use something like
from openpyxl import Workbook
(some commands to open the workbook)
listofnames=wb.sheetnames
for k in listofnames:
ws=wb.worksheets(k)
and then input your remaining commands.