I am trying to import a .csv file into Google colab but I am running into this error.
I am using HuggingFace nlp, as I need certain functionality from it so that is why I am importing it with that lib.
Here is the data I am trying to import. It is about 80 mb labelled original_data.
It is some sort of error with some annoying space, whitespace or tab which I can't find.
So I used =TRIM(CLEAN()) in excel to remove spaces between sentences and after the final sentence since it is unnecessary. I uploaded that to the drive as well for reference named data.csv.
I then converted data.csv (trimmed and cleaned) into a pandas df and tried converting to a dataset that way but ended up with this.
After adding the on_bad_lines = 'skip', I ended up with 15k less rows. Here.
I think I am overthinking it too much and there is a easy way to fix this in excel. Please let me know and thank you in advance.
Related
I was looking for how to unload the obtained data into a .txt or csv file, but I could not find a simple and understandable solution with my level of understanding of this process.
I need to sort words by frequency and highlight the top 100 words. I did it (I know not in the best way, I did everything in the Google collaboratori)
from collections import Counter
Counter(" ".join(test_data['body']).split()).most_common(100)
DATA= Counter(" ".join(test_data['body']).split()).most_common(100)
DATA
Question:
how to save the result from these top 100 words to a text file .csv or .txt, and possibly an Excel version.(or in three versions at once)
I'm just learning and don't know a lot, trying to figure it out and understand.
Here is a link to the collab, for me the problem is that the words are Russian, and all the practices are for English texts, and it's easier than working with Russian text.
https://colab.research.google.com/drive/1LZ3RHPTjTib8lUjzKGcCJgzYnODSjewL?usp=sharing
As I can see in the colab file you are using pandas so the best way would be to use Pandas to_csv function to write to csv and txt by modifying the delimiter argument
You can write to excel using to_excel
Now since you want the top 100 of a particular column only you can first extract that into a separate dataframe by indexing on the column, sorting it (if not sorted) and using head for range 100
I have an SAS file that is roughly 112 million rows. I do not actually have access to SAS software, so I need to get this data into, preferably, a pandas DataFrame or something very similar in the python family. I just don't know how to do this efficiently. ie, just doing df = pd.read_sas(filename.sas7bdat) takes a few hours. I can do chunk sizes but that doesn't really solve the underlying problem. Is there any faster way to get this into pandas, or do I just have to eat the multi-hour wait? Additionally, even when I have read in the file, I can barely do anything with it because iterating over the df takes forever as well. It usually just ends up crashing the Jupyter kernel. Thanks in advance for any advice in this regard!
Regarding the first part, I guess there is not much to do as the read_sas options are limited.
For the second part 1. iterating manually through rows is slow and not pandas philosophy. Whenever possible use vectorial operations. 2. Try to look into specialized solutions for large datasets, like dask. Also read how to scale to large dataframes.
Maybe you don't need your entire file to work on it so you can take 10%. You can also change your variable types to reduce its memory.
if you want to store a df and re use it instead of re importing the entire file each time you want to work on it you can save it as a pickle file (.pkl) and re open it by using pandas.read_pickle
I'm currently trying to learn more about Deep learning/CNN's/Keras through what I thought would be a quite simple project of just training a CNN to detect a single specific sound. It's been a lot more of a headache than I expected.
I'm currently reading through this ignoring the second section about gpu usage, the first part definitely seems like exactly what I'm needing. But when I go to run the script, (my script is pretty much totally lifted from the section in the link above that says "Putting the pieces together, you may end up with something like this:"), it gives me this error:
AttributeError: 'DataFrame' object has no attribute 'file_path'
I can't find anything in the pandas documentation about a DataFrame.file_path function. So I'm confused as to what that part of the code is attempting to do.
My CSV file contains two columns, one with the paths and then a second column denoting the file paths as either positive or negative.
Sidenote: I'm also aware that this entire guide just may not be the thing I'm looking for. I'm having a very hard time finding any material that is useful for the specific project I'm trying to do and if anyone has any links that would be better I'd be very appreciative.
The statement df.file_path denotes that you want access the file_path column in your dataframe table. It seams that you dataframe object does not contain this column. With df.head() you can check if you dataframe object contains the needed fields.
I have some pretty strange data i'm working with, as can be seen in the image. Now I can't seem to find any source data for the numbers these graphs are presenting.
Furthermore if I search for the source it only points to an empty cell for each graph.
Ideally I want to be able to retrieve the highlighted labels in each case using python, and it seems finding the source is the only way to do this, so if you know of a python module that can do that i'd be happy to use it. Otherwise if you can help me find the source data that would be even perfecter :P
So far i've tried the XLDR module for python as well as manually showing all hidden cells, but neither work.
Here's a link to the file: Here
EDIT I ended up just converting the xlsx to a pdf using cloudconvert.com API
Then using pdftotext to convert the data to a .txt which just analyses everything including the numbers on the edge of the chart which can then be searched using an algorithm.
If a hopeless internet wanderer comes upon this thread with the same problem, you can PM me for more details :P
I've been learning the ins and outs of Pandas by way of manipulating large csv files obtained online, the files are time-series of financial data. I have so far figured out how to use HDFStore to store and manipulate them, however I was wondering if there exists an easier way to update the files, without re-downloading the entire source file?
I ask because I'm working with 12 ~300+MB files, which update every 15mins. While I don't need the update to be continuous it'd be swell to not download what I already have.
The Blaze library from Continuum should help you. You can find an introduction here.