I'm currently trying to learn more about Deep learning/CNN's/Keras through what I thought would be a quite simple project of just training a CNN to detect a single specific sound. It's been a lot more of a headache than I expected.
I'm currently reading through this ignoring the second section about gpu usage, the first part definitely seems like exactly what I'm needing. But when I go to run the script, (my script is pretty much totally lifted from the section in the link above that says "Putting the pieces together, you may end up with something like this:"), it gives me this error:
AttributeError: 'DataFrame' object has no attribute 'file_path'
I can't find anything in the pandas documentation about a DataFrame.file_path function. So I'm confused as to what that part of the code is attempting to do.
My CSV file contains two columns, one with the paths and then a second column denoting the file paths as either positive or negative.
Sidenote: I'm also aware that this entire guide just may not be the thing I'm looking for. I'm having a very hard time finding any material that is useful for the specific project I'm trying to do and if anyone has any links that would be better I'd be very appreciative.
The statement df.file_path denotes that you want access the file_path column in your dataframe table. It seams that you dataframe object does not contain this column. With df.head() you can check if you dataframe object contains the needed fields.
Related
Im trying to tokenize a gensim dataset, which I've never worked with before and Im not sure if its a small bug or im not doing it properly.
I loaded the dataset using
model = api.load('word2vec-google-news-300')
and from my understanding, to tokenize using nltk all I need to do it call
tokens = word_tokenize(model)
However, the error im getting is "TypeError: expected string or bytes-like object". What am I doing wrong?
word2vec-google-news-300 isn't a dataset that's appropriate to 'tokenize'; it's the pretrained GoogleNews word2vec model released by Google circa 2013 with 3 million word-vectors. It's got lots of word-tokens, each with a 300-dimensional vector, but no multiword texts needing tokenization.
You can run type(model) on the object that api.load() returns to see its Python type, which will offer more clues as to what's appropriate to do with it.
Also, something like nltk's word_tokenize() appears to take a single string; you'd typically not pass it any full large dataset, in one call, in any case. (You'd be more likely to iterate over many individual texts as strings, tokenizing each in turn.)
Rewind a bit & think more about what kind of dataset you're looking for.
Try to get it in a simple format you can inspect it yourself, as files, before doing extra steps. (Gensim's api.load() is really bad/underdocumented for that, returning who-knows-what depending on what you've requested.)
Try building on well-explained examples that already work, making minimal individual changes that you understand individually, checking continued proper operation after each step.
(Also, for future SO questions that may be any more complicated than this: it's usually best to include the full error message you've received, including all lines of 'traceback' context showing involved files and lines-of-code, in order to better point at relevant lines-of-code in your code, or the libraries you're using, that are most-directly involved.)
Both "krogh" and "barycentric" seem to not clean the dataframe fully (meaning between the first non-NaN and the last non-NaN).
What are they intended to use for? (My purpose would be a Timeseries).
Context: I'm setting up a pipeline with different cleaning functions to test later and adapted the whole pandas.DataFrame.interpolation() function because it comes in pretty handy.
I am an elementary Python programmer and have been using this module called "Pybaseball" to analyze sabermetrics data. When using this module, I came across a problem when trying to retrieve information from the program. The program reads a CSV file from any baseball stats site and outputs it onto a program for ease of use but the problem is that some of the information is not shown and is instead all replaced with a "...". An example of this is shown:
from pybaseball import batting_stats_range
data = batting_stats_range('2017-05-01', '2017-05-08')
print(data.head())
I should be getting:
https://github.com/jldbc/pybaseball#batting-stats-hitting-stats-for-players-within-seasons-or-during-a-specified-time-period
But the information is cutoff from 'TM' all the way to 'CS' and is replaced with a ... on my code. Can someone explain to me why this happens and how I can prevent it?
As the docs states, head() is meant for "quickly testing if your object has the right type of data in it." So, it is expected that some data may not show because it is collapsed.
If you need to analyze the data with more detail you can access specific columns with other methods.
For example, using iloc(). You can read more about it here, but essentially you can "ask" for a slice of those columns and then apply a new slice to get only nrows.
Another example would be loc(), docs here. The main difference being that loc() uses labels (column names) to filter data instead of numerical order of columns. You can filter a subset of specific columns and then get a sample of rows from that.
So, to answer your question "..." is pandas's way of collapsing data in order to get a prettier view of the results.
I have many files with three million lines in identical tab delimited format. All I need to do is divide the number in the 14th "column" by the number in the 12th "column", then set the number in the 14th column to the result.
Although this is a very simple function I'm actually really struggling to work out how to achieve this. I've spent a good few hours searching this website but unfortunately the answers I've seen have completely gone over the top of my head as I'm a novice coder!
The tools I have Notepad++ and Ultraedit (which has the ability to use Javascript, although i'm not familiar with this), and Python 3.6 (I have very basic Python knowledge). Other answers have suggested using something called "awk", but when I looked this up it needs Unix - I only have Windows. What's the best tool for getting this done? I'm more than willing to learn something new.
In python there are a few ways to handle csv. For your particular use case
I think pandas is what you are looking for.
You can load your file with df = pandas.read_csv(), then performing your division and replacement will be as easy as df[13] /= df[11].
Finally you can write your data back in csv format with df.to_csv().
I leave it to you to fill in the missing details of the pandas functions, but I promise it is very easy and you'll probably benefit from learning it for a long time.
Hope this helps
I have some pretty strange data i'm working with, as can be seen in the image. Now I can't seem to find any source data for the numbers these graphs are presenting.
Furthermore if I search for the source it only points to an empty cell for each graph.
Ideally I want to be able to retrieve the highlighted labels in each case using python, and it seems finding the source is the only way to do this, so if you know of a python module that can do that i'd be happy to use it. Otherwise if you can help me find the source data that would be even perfecter :P
So far i've tried the XLDR module for python as well as manually showing all hidden cells, but neither work.
Here's a link to the file: Here
EDIT I ended up just converting the xlsx to a pdf using cloudconvert.com API
Then using pdftotext to convert the data to a .txt which just analyses everything including the numbers on the edge of the chart which can then be searched using an algorithm.
If a hopeless internet wanderer comes upon this thread with the same problem, you can PM me for more details :P