Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I'm trying to take a 1.8mB txt file. There are couple of header lines afterwards its all space separated data. I can pull the data off using pandas. What I'm wanting to do with the data is:
1) Cut out the non essential data. ie the first 1675 lines, roughly I want to remove and the last 3-10 lines, varies day to day, I also want to remove. I can remove the first lines, kind of. The big problem with this idea I'm having right now is knowing for sure where the 1675 pointer location is. Using something like
df = df[df.year > 1978]
only moves the initial 'pointer' to 1675. If I try
dataf = df[df.year > 1978]
it just gives me a pure copy of what I would have with the first line. It still keeps the pointer to the same 1675 start point. It won't allow me to access any of the first 1675 rows but they are still obviously there.
df.year[0]
It comes back with an error suggesting row 0 doesn't exist. I have to go out and search to find what the first readable row is...instead of flat out removing the rows and moving the new pointer up to 0 it just moves the pointer to 1675 and won't allow access to anything lower than that. I still haven't found a way to be able to determine what the last row number is through programming, through the shell it's easy but I need to be able to do it through the program so I can set up the loop for point 2.
2) I want to be able to take averages of the data, 'x' day moving averages and create a new column with the new data once I have calculated the moving average. I think I can create the new column with the Series statement...I think...I haven't tried it yet though as I haven't been able to get this far yet.
3) After all this and some more math I want to be able to graph the data with a homemade graph. I think this should be easy once I have everything else completed. I have already created the sample graph and can plot the points/lines on the graph once I have the data to work with.
Is panda the right lib for the project or should I be trying to use something else? So far the more research I do...the more lost I get as everything I keep trying gets me a little further but sets me even further back at the same time. In something similar I saw mention using something else when wanting to do math on the data block. Their wasn't any indication as to what he used though.
It sounds like you main trouble is indexing. If you want to refer to the "first" thing in a DataFrame, use df.iloc[0]. But DataFrame indexes are really powerful regardless.
http://pandas.pydata.org/pandas-docs/stable/indexing.html
I think you are headed in the right direction. Pandas gives you nice, high level control over your data so that you can manipulate it much more easily than using traditional logic. It will take some work to learn. Work through their tutorials and you should be fine. But don't gloss over them or you'll miss some important details.
I'm not sure why you are concerned that the lines you want to ignore aren't being deleted as long as they aren't used in your analysis, it doesn't really matter. Unless you are facing memory constraints, it's probably irrelevant. But, if you do find you can't afford to keep them around, I'm sure there is a way to really remove them, even if it's a bit sideways.
Processing a few megabytes worth of data is pretty easy these days and Pandas will handle it without any problems. I believe you can easily pass pandas data to numpy for your statistical calculations. You should double check that, though, before taking my word for it. Also, they mention matplotlib on the pandas website, so I am guessing it will be easy to do basic graphing as well.
Related
I'm currently trying to learn more about Deep learning/CNN's/Keras through what I thought would be a quite simple project of just training a CNN to detect a single specific sound. It's been a lot more of a headache than I expected.
I'm currently reading through this ignoring the second section about gpu usage, the first part definitely seems like exactly what I'm needing. But when I go to run the script, (my script is pretty much totally lifted from the section in the link above that says "Putting the pieces together, you may end up with something like this:"), it gives me this error:
AttributeError: 'DataFrame' object has no attribute 'file_path'
I can't find anything in the pandas documentation about a DataFrame.file_path function. So I'm confused as to what that part of the code is attempting to do.
My CSV file contains two columns, one with the paths and then a second column denoting the file paths as either positive or negative.
Sidenote: I'm also aware that this entire guide just may not be the thing I'm looking for. I'm having a very hard time finding any material that is useful for the specific project I'm trying to do and if anyone has any links that would be better I'd be very appreciative.
The statement df.file_path denotes that you want access the file_path column in your dataframe table. It seams that you dataframe object does not contain this column. With df.head() you can check if you dataframe object contains the needed fields.
So I'm a newbie when it comes to working with big data.
I'm dealing with a 60GB CSV file so I decided to give Dask a try since it produces pandas dataframes. This may be a silly question but bear with me, I just need a little push in the right direction...
So I understand why the following query using the "compute" method would be slow(lazy computation):
df.loc[1:5 ,'enaging_user_following_count'].compute()
btw it took 10 minutes to compute.
But what I don't understand is why the following code using the "head" method prints the output in less than two seconds (i.e., In this case, I'm asking for 250 rows while the previous code snippet was just for 5 rows):
df.loc[50:300 , 'enaging_user_following_count'].head(250)
Why doesn't the "head" method take a long time? I feel like I'm missing something here because I'm able to pull a huge number of rows in a way shorter time than when using the "compute" method.
Or is the compute method used in other situations?
Note: I tried to read the documentation but there was no explanation to why head() is fast.
I played around with this a bit half a year ago. .head() is not checking all your partitions but simply checks the first partition. There is no synchronization overhead etc. so it is quite fast, but it does not take the whole Dataset into account.
You could try
df.loc[-251: , 'enaging_user_following_count'].head(250)
IIRC you should get the last 250 entries of the first partition instead of the actual last indices.
Also if you try something like
df.loc[conditionThatIsOnlyFulfilledOnPartition3 , 'enaging_user_following_count'].head(250)
you get an error that head could not find 250 samples.
If you actually just want the first few entries however it is quite fast :)
This processes the whole dataset
df.loc[1:5, 'enaging_user_following_count'].compute()
The reason is, that loc is a label-based selector, and there is no telling what labels exist in which partition (there's no reason that they should be monotonically increasing). In the case that the index is well-formed, you may have useful values of df.divisions, and in this case Dask should be able to pick only the partitions of your data that you need.
I have many files with three million lines in identical tab delimited format. All I need to do is divide the number in the 14th "column" by the number in the 12th "column", then set the number in the 14th column to the result.
Although this is a very simple function I'm actually really struggling to work out how to achieve this. I've spent a good few hours searching this website but unfortunately the answers I've seen have completely gone over the top of my head as I'm a novice coder!
The tools I have Notepad++ and Ultraedit (which has the ability to use Javascript, although i'm not familiar with this), and Python 3.6 (I have very basic Python knowledge). Other answers have suggested using something called "awk", but when I looked this up it needs Unix - I only have Windows. What's the best tool for getting this done? I'm more than willing to learn something new.
In python there are a few ways to handle csv. For your particular use case
I think pandas is what you are looking for.
You can load your file with df = pandas.read_csv(), then performing your division and replacement will be as easy as df[13] /= df[11].
Finally you can write your data back in csv format with df.to_csv().
I leave it to you to fill in the missing details of the pandas functions, but I promise it is very easy and you'll probably benefit from learning it for a long time.
Hope this helps
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
First of all, I know next to nothing about databases, so if the answer to my questions is "read a book on DBs", don't hesitate to tell me.
I have a large collection1 of environmental time series data collected at a number of different sites around the world. All time series have different lengths (e.g. one site may have data for a year, another one for two years, etc.), but they are generally in the same format (same column headers; columns of variables that were not measured at a specific site are filled with N/A). In addition, meta data including site description, instruments used, etc. are available for every dataset.
What I would like to do is store these measurements in a database that I can easily access using Python. I would like to analyse them using Pandas, so it would be great if there was a way to make this work with data frames instead of arrays for each single column. It probably won't be much of a problem to store each column as indivudal arrays and construct data frames afterwards, though, if that makes more sense (for instance, to drop the N/A columns) and/or is easier to implement. Also, access speed has priority over file size.
It would be best to have a database that can work with queries like "give me the temperature time series from all grassland sites", "plot wind speed against time of day for all European measurements", and similar requests.
Of course I am not asking you for a complete solution, but I would be very grateful for some pointers into the right direction. What type of DB am I looking for? Is there something Python can work with? I was looking into PyTables, but I'm not exactly sure if it a hierarchical database suited for my tasks (or if anything else is anyway)? Thanks in advance.
1To be exact, I don't have it yet, but that's what I will work with in the near future. It's probably not what some of you would call "large collection". The whole DB needs to hold less than 1000 tables each with less than 100 columns and less than 100k rows.
I'd suggest using HDF5 for this. It's a disk file format which supports hierarchies, arrays, metadata like comments, and more. And it integrates very well with Python/NumPy via h5py and with Pandas via PyTables. See here: http://pandas.pydata.org/pandas-docs/stable/io.html#io-tools-text-csv-hdf5
Now, you may be saying "That's not a database!" Of course it isn't. But the example queries you gave, and my own experience with time series data, suggest that you don't need a traditional database system, because a lot of what you'll do with the data will occur on the client side, and the amount of data you want to store is possible to load into memory on commodity machines.
HDF5 supports compression (you may not want this if you only care about access speed). It's easy to read from multiple languages, including C++, Python, R, and more. It's also quite mature and battle-tested.
I'd consider storing each site's data in one file; this may make basic management tasks easier. But HDF5 has an internal hierarchy as well if you prefer to have it all in a single file. Depending on your access patterns you might make a different decision too, such as storing everything in a single file per month or so. Once you try it out for a while you'll probably come to a good understanding of what layout makes the most sense. There is also a tradeoff to be made with "chunking" if you will later add rows or columns (one or the other will be optimally efficient depending on how you store the data).
I want to create a table with Python that looks like a simple excel table, therefore I have already used the pyExcelerator. But now I thought about just using pyplot.table which seems to be very easy. However, I need to make some changes and I don't know if this is possible in vpyplot.table`.
For example I want to add a cell in the upper left corner and I also want to make two cells beneath the cell t+1 (see the table example below).
So, is it possible to do these changes in pyplot.table or should I better use another way to make tables?
Building a program to generate an table in a image for inclusion into your word document is a bit overkill. Its a lot of added work and completely unnecessary effort. Make the table in Excel and then paste it into Word. It'll look good and will be easier to update and change.
If you are using this as an excuse to learn something new, that is all well and good, but you need to give us more to help you with. SO isn't a code factory. Offer up what you have tried, and samples of code you are having trouble with. We can help with that.