Divide one "column" by another in Tab Delimited file - python

I have many files with three million lines in identical tab delimited format. All I need to do is divide the number in the 14th "column" by the number in the 12th "column", then set the number in the 14th column to the result.
Although this is a very simple function I'm actually really struggling to work out how to achieve this. I've spent a good few hours searching this website but unfortunately the answers I've seen have completely gone over the top of my head as I'm a novice coder!
The tools I have Notepad++ and Ultraedit (which has the ability to use Javascript, although i'm not familiar with this), and Python 3.6 (I have very basic Python knowledge). Other answers have suggested using something called "awk", but when I looked this up it needs Unix - I only have Windows. What's the best tool for getting this done? I'm more than willing to learn something new.

In python there are a few ways to handle csv. For your particular use case
I think pandas is what you are looking for.
You can load your file with df = pandas.read_csv(), then performing your division and replacement will be as easy as df[13] /= df[11].
Finally you can write your data back in csv format with df.to_csv().
I leave it to you to fill in the missing details of the pandas functions, but I promise it is very easy and you'll probably benefit from learning it for a long time.
Hope this helps

Related

Lazily reading a parquet file with binary datatype in PyPolars

I hope this is a good question, if I should post this as an issue on the PyPolars GitHub instead, please let me know.
I have a quite large parquet file where some columns contain binary data.
These columns are not interesting for me right now, so it is ok for me that PyPolars does not support the Binary datatype so far (this is how I understand it at least, my question would be irrelevant if that were not the case!), but I would like to make full use of the query optimization by lazily reading the file with .scan_parquet() instead of read_parquet().
Currently .scan_parquet() gives me the following error:
pyo3_runtime.PanicException: Arrow datatype Binary not supported by Polars. You probably need to activate that data-type feature.
and I don't know of a way to 'activate that data-type feature'
So my workaround is to use .read_parquet() and specify in advance which columns I want to use so that it never attempts to read the Binary ones.
The problem is I am doing exploratory data analysis and there are a large amount of columns so for one it is annoying to have to specify a large list of columns (basically ~150 minus the two that produce the issue) and it is also inefficient to read all these columns each time when I only need some small subset each time (it is even more annoying to change a small list of columns each time I, for example, add some filter).
It would be ideal if I could use .scan_parquet and let the query optimizer figure out that it only needs to read the (unproblematic) columns that I actually need.
Is there a better way of doing things that I am not seeing?

Python: Take CSV tables and find the 4 highest values and their location

I have a folder of 310 .csv files. Here is a sample of what the contents look like
I need to create a program that goes through all the files, lists the file name, then lists the top 4 values from the table and the x-value associated with it. Ideally this would all be saved to a text doc but as long as it prints in a readable format that would be ideal.
So, what is stopping you? Loop through the files, use pandas.read_csv to import each csv file and merge/join them all into one DataFrame. Use slicing to select the 4 top rows, and you can always print/visualize anything directly in a Jupyter Notebook. Exporting can be done using df.to_csv or any other method and if you need a short introduction to pandas, look here.
Keep in mind that it is always a good idea, to include a Minimal, Reproducible Example. Especially for a complicated merge operation between many DataFrames, this could help you a lot. However, there is no way around some research.

Assistance with Keras for a noise detection script

I'm currently trying to learn more about Deep learning/CNN's/Keras through what I thought would be a quite simple project of just training a CNN to detect a single specific sound. It's been a lot more of a headache than I expected.
I'm currently reading through this ignoring the second section about gpu usage, the first part definitely seems like exactly what I'm needing. But when I go to run the script, (my script is pretty much totally lifted from the section in the link above that says "Putting the pieces together, you may end up with something like this:"), it gives me this error:
AttributeError: 'DataFrame' object has no attribute 'file_path'
I can't find anything in the pandas documentation about a DataFrame.file_path function. So I'm confused as to what that part of the code is attempting to do.
My CSV file contains two columns, one with the paths and then a second column denoting the file paths as either positive or negative.
Sidenote: I'm also aware that this entire guide just may not be the thing I'm looking for. I'm having a very hard time finding any material that is useful for the specific project I'm trying to do and if anyone has any links that would be better I'd be very appreciative.
The statement df.file_path denotes that you want access the file_path column in your dataframe table. It seams that you dataframe object does not contain this column. With df.head() you can check if you dataframe object contains the needed fields.

Python: Is pandas the right thing to use? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I'm trying to take a 1.8mB txt file. There are couple of header lines afterwards its all space separated data. I can pull the data off using pandas. What I'm wanting to do with the data is:
1) Cut out the non essential data. ie the first 1675 lines, roughly I want to remove and the last 3-10 lines, varies day to day, I also want to remove. I can remove the first lines, kind of. The big problem with this idea I'm having right now is knowing for sure where the 1675 pointer location is. Using something like
df = df[df.year > 1978]
only moves the initial 'pointer' to 1675. If I try
dataf = df[df.year > 1978]
it just gives me a pure copy of what I would have with the first line. It still keeps the pointer to the same 1675 start point. It won't allow me to access any of the first 1675 rows but they are still obviously there.
df.year[0]
It comes back with an error suggesting row 0 doesn't exist. I have to go out and search to find what the first readable row is...instead of flat out removing the rows and moving the new pointer up to 0 it just moves the pointer to 1675 and won't allow access to anything lower than that. I still haven't found a way to be able to determine what the last row number is through programming, through the shell it's easy but I need to be able to do it through the program so I can set up the loop for point 2.
2) I want to be able to take averages of the data, 'x' day moving averages and create a new column with the new data once I have calculated the moving average. I think I can create the new column with the Series statement...I think...I haven't tried it yet though as I haven't been able to get this far yet.
3) After all this and some more math I want to be able to graph the data with a homemade graph. I think this should be easy once I have everything else completed. I have already created the sample graph and can plot the points/lines on the graph once I have the data to work with.
Is panda the right lib for the project or should I be trying to use something else? So far the more research I do...the more lost I get as everything I keep trying gets me a little further but sets me even further back at the same time. In something similar I saw mention using something else when wanting to do math on the data block. Their wasn't any indication as to what he used though.
It sounds like you main trouble is indexing. If you want to refer to the "first" thing in a DataFrame, use df.iloc[0]. But DataFrame indexes are really powerful regardless.
http://pandas.pydata.org/pandas-docs/stable/indexing.html
I think you are headed in the right direction. Pandas gives you nice, high level control over your data so that you can manipulate it much more easily than using traditional logic. It will take some work to learn. Work through their tutorials and you should be fine. But don't gloss over them or you'll miss some important details.
I'm not sure why you are concerned that the lines you want to ignore aren't being deleted as long as they aren't used in your analysis, it doesn't really matter. Unless you are facing memory constraints, it's probably irrelevant. But, if you do find you can't afford to keep them around, I'm sure there is a way to really remove them, even if it's a bit sideways.
Processing a few megabytes worth of data is pretty easy these days and Pandas will handle it without any problems. I believe you can easily pass pandas data to numpy for your statistical calculations. You should double check that, though, before taking my word for it. Also, they mention matplotlib on the pandas website, so I am guessing it will be easy to do basic graphing as well.

fault-tolerant python based parser for WikiLeaks cables

Some time ago I started writing a BNF-based grammar for the cables which WikiLeaks released. However I now realized that my approach is maybe not the best and I'm looking for some improvement.
A cabe consists of three parts. The head has some RFC2822-style format. This parses usually correct. The text part has a more informal specification. For instance, there is a REF-line. This should start with REF:, but I found different versions. The following regex catches most cases: ^\s*[Rr][Ee][Ff][Ss: ]. So there are spaces in front, different cases and so on. The text part is mostly plain text with some special formatted headings.
We want to recognize each field (date, REF etc.) and put into a database. We chose Pythons SimpleParse. At the moment the parses stops at each field which it doesn't recognize. We are now looking for a more fault-tolerant solution. All fields have some kind of order. When the parser don't recognize a field, it should add some 'not recognized'-blob to the current field and go on. (Or maybe you have some better approach here).
What kind of parser or other kind of solution would you suggest? Is something better around?
Cablemap seems to do what you're searching for: http://pypi.python.org/pypi/cablemap.core/
I haven't looked at the cables but lets take a similar problem and consider the options: Lets say you wanted to write a parser for RFCs, there's an RFC for formatting of RFCs, but not all RFCs follow it.
If you wrote a strict parser, you'll run into the situation you've run into - the outliers will halt your progress - in this case you've got two options:
Split them into two groups, the ones that are strictly formatted and the ones that aren't. Write your strict parser so that it gets the low hanging fruit and figure out based on the number outliers what the best options are (hand processing, outlier parser, etc)
If the two groups are equally sized, or there are more outliers than standard formats - write a flexible parser. In this case regular expressions are going to be more beneficial to you as you can process an entire file looking for a series of flexible regexs, if one of the regexes fails you can easily generate the outlier list. But, since you can make the search against a series of regexes you could build a matrix of pass/fails for each regex.
For 'fuzzy' data where some follow the format and some do not, I much prefer using the regex approach. That's just me though. (Yes, it is slower, but having to engineer the relationship between each match segment so that you have a single query (or parser) that fits every corner case is a nightmare when dealing with human generated input.

Categories

Resources