Processing large csv files in Python [closed]

Processing large csv files in Python [closed] - python

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
I have large CSV files containing more than 315 million rows and a single column. I have to process more than 50 such files at a time to get the results.
As I read more than 10 using a csv reader, it takes more than 12GB of RAM and is painfully slow. I can read only a chunk of the file to save memory but would spend more time in reading the file as it will read the whole file every time.
I have thought about loading them into a database and querying the data from there. However, I am not sure if this approach would help in any way. Can anyone please tell which is the most efficient way of handling such scenarios in Python?

You will find the solution here
Lazy Method for Reading Big File in Python?
Additionally, if you have a longer processing pipeline, you can look into Section 4.13. Creating Data Processing Pipelines in the book, Python Cookbook, 3rd edition by Beazly and Jones.

Check out ETLyte, a tool I've just open sourced. It's .NET, but you could call out to the EXE from Python. It's still a work in progress but I think it'll work for your situation.
With ETLyte, here would be the steps:
Place files in Flatfiles folder, or whichever folder you specify in the config.json.
Describe them with a JSON schema and place them in the Schemas folder, or whichever you specify (Note: If they all have the same schema [you said it's all a single column], then just change flatfile field in the schema to a regex that matches your files)
When it comes to performing the addition/multiplication, you could create derived columns that perform that calculation.
Run ETLyteExe.exe and allow the data to flow in
ETLyte is just getting started but it has a lot of features and a lot more on the roadmap. It also comes with an interactive REPL with word completion which wraps the SQLite DLL so you can interrogate the data without installing sqlite3. For an overview of the tool, look here.

Related

Is it better to have one large file or many smaller files for data storage? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 10 months ago.
Improve this question
I have an C++ game which sends a Python-SocketIO request to a server, which loads the requested JSON data into memory for reference, and then sends portions of it to the client as necessary. Most of the previous answers here detail that the server has to repeatedly search the database, when in this case, all of the data is stored in memory after the first time, and is released after the client disconnects.
I don't want to have a large influx of memory usage whenever a new client joins, however most of what I have seen points away from using small files (50-100kB absolute maximum), and instead use large files, which would cause the large memory usage I'm trying to avoid.
My question is this: would it still be beneficial to use one large file, or should I use the smaller files; both from an organization standpoint and from a performance one?

Is it better to have one large file or many smaller files for data storage?
Both can potentially be better. Each have their advantage and disadvantage. Which is better depends on the details of the use case. It's quite possible that best way may be something in between such as a few medium sized files.
Regarding performance, the most accurate way to verify what is best is to try out each of them and measure.

You should separate it into multiple files for less memory if you're only accessing small parts of it. For example, if you're only accessing let's say a player, then your folder structure would look like this:
players
- 0.json
- 1.json
other
- 0.json
Then you could write a function that just gets the player with a certain id (0, 1, etc.).
If you're planning on accessing all of the players, other objects, and more at once, then have the same folder structure and just concatenate the parts you need into one object in memory.

Should I use a SQLite database or Pandas for my application [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I have a user installable application the takes a 2-5 MB JSON file and then queries the data for metrics. It will pull metrics like the number of unique items, or the number of items with a field set to a certain value, etc. Sometimes, it pulls metrics that are more tabular like returning all items with certain properties and all their fields from the JSON.
I need help making a technology choice. I am between using either Pandas or SQLite with peewee as an ORM. I am not concerned about converting the JSON file to a SQLite database, I already have this prototyped. I want help evaluating the pros and cons of a SQLite database versus Pandas.
Other factors to consider are that my application may require analyzing metrics across multiple JSON files of the same structure. For example, how many unique items are there across 3 selected JSON files.
I am news to Pandas so I can't make a strong argument for or against it yet. I am comfortable with SQLite with an ORM, but don't want to settle if this technology choice would be restrictive for future development. I don't want to factor in a learning curve. I just want an evaluation on the technologies head-to-head for my application.

You are comparing a database to an in-memory processing library. They are two seperate ideas. Do you need persistent storage over multiple runs of code? Use SQLite (since you're using metrics I would guess this is the path you need). You could use Pandas to write CSV's/TSV's and use those as permanent storage but you'll eventually start to bottleneck having to load multiple CSV's into one Dataframe for processing.
Your use case sounds better suited to using SQLite, in my opinion.

Python with large files slower than perl: what am I doing wrong? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I'm increasingly using python instead of perl but have one problem: always when I want to process large files (>1GB) line by line python seems to take ages for a job that perl does in a fraction of the time. However, the general opinion on the web seems to be that python should be at least as fast for text processing as perl. So my question is what am I doing wrong?
Example:
Read a file line by line, split the line at every tab and add the second item to a list.
My python solution would look something like this:
with open() as infile:
for line in infile:
ls = line.split("\t")
list.append(ls[1])
The perl code would look like this:
open(my $infile,"<",file_path);
while(my $line=<$infile>){
my #ls = split(/\t/,$line);
push #list, $ls[1]
}
close($infile)
Is there any way to speed this up?
And to make it clear: I don't want to start the usual "[fill in name of script language A] is sooo much better than [fill in script language B]" thread. I'd like to use python more by this is a real problem for my work.
Any suggestions?

Is there any way to speed this up?
Yes, import the CSV in SQLite and process it there. In your case you want .mode tabs instead of .mode csv.
Using any programming language to manipulate a CSV file is going to be slow. CSV is a data transfer format, not a data storage format. They'll always be slow and unwieldy to work with because you're constantly reparsing and reprocessing the CSV file.
Importing it into SQLite will put it into a much, much more efficient data format with indexing. It will take about as much time as Python would, but only has to be done once. It can be processed using SQL, which means less code to write, debug, and maintain.
See Sqlite vs CSV file manipulation performance.

efficient database file trees [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
So I was making a simple chat app with python. I want to store user specific data in a database, but I'm unfamiliar with efficiency. I want to store usernames, public rsa keys, missed messages, missed group messages, urls to profile pics etc.
There's a couple of things in there that would have to be grabbed pretty often, like missed messages and profile pics and a couple of hashes. So here's the question: what database style would be fastest while staying memory efficient? I want it to be able to handle around 10k users (like that's ever gonna happen).
heres some I thought of:
everything in one file (might be bad on memory, and takes time to load in, important, as I would need to load it in after every change.)
seperate files per user (Slower, but memory efficient)
seperate files
per data value
directory for each user, seperate files for each value.
thanks,and try to keep it objective so this isnt' instantly closed!

The only answer possible at this point is 'try it and see'.
I would start with MySQL (mostly because it's the 'lowest common denominator', freely available everywhere); it should do everything you need up to several thousand users, and if you get that far you should have a far better idea of what you need and where the bottlenecks are.

What do third party libraries like openpyxl or xlrd/xlwt have, what win32com doesn't have? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
win32com is a general library to access COM objects from Python.
One of the major hallmarks of this library is ability to manipulate excel documents.
However, there is lots of customized modules, whose only purpose it to manipulate excel documents, like openpyxl, xlrd, xlwt, python-tablefu.
Are these libraries any better for this specific task? If yes, in what respect?

Open and write directly and efficiently excel files, for instance.
win32com uses COM communication, which while being very useful for certain purposes, it needs to perform complicated API calls that can be very slow (so to say, you are using code that controls Windows, that controls Excel)
openpyxl or others, just open an excel file, parse it and let you work with it.
Try to populate an excel file with 2000 rows, 100 cells each, with win32com and then with any other direct parser. While a parser needs seconds, win32com will need minutes.
Besides, openpyxl (I haven't tried the others) does not need that excel is installed in the system. It does not even need that the OS is windows.
Totally different concepts. win32com is a piece of art that opens a door to automate almost anything, while the other option is just a file parser. In other words, to iron your shirt, you use a $20 iron, not a 100 ton metal sheet attached to a Lamborghini Diablo.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Processing large csv files in Python [closed] - python

You will find the solution here Lazy Method for Reading Big File in Python? Additionally, if you have a longer processing pipeline, you can look into Section 4.13. Creating Data Processing Pipelines in the book, Python Cookbook, 3rd edition by Beazly and Jones.

Related

Is it better to have one large file or many smaller files for data storage? [closed]

Should I use a SQLite database or Pandas for my application [closed]

Python with large files slower than perl: what am I doing wrong? [closed]

efficient database file trees [closed]

What do third party libraries like openpyxl or xlrd/xlwt have, what win32com doesn't have? [closed]

Categories

Resources