I'm trying to pickle a pandas dataframe to my local directory so I can work on it in another jupyter notebook. The write appears to go successful at first but when trying to read it in a new jupyter notebook the read is unsuccessful.
When I open the pickle file I appear to have wrote, the file's only contents are:
Error! /Users/.../income.pickle is not UTF-8 encoded
Saving disabled.
See console for more details.
I also checked and the pickle file itself is only a few kilobytes.
Here's my code for writing the pickle:
with open('income.pickle', 'wb', encoding='UTF-8') as to_write:
pickle.dump(new_income_df, to_write)
And here's my code for reading it:
with open('income.pickle', 'rb') as read_file:
income_df = pickle.load(read_file)
Also when I return income_df I get this output:
Series([], dtype: float64)
It's an empty series that I errors on when trying to call most series methods on it.
If anyone knows a fix for this I'm all ears. Thanks in advance!
EDIT:
This is the solution I arrived at:
with open('cleaned_df', 'wb') as to_write:
pickle.dump(df, to_write)
with open('cleaned_df','rb') as read_file:
df = pickle.load(read_file)
Which was much simpler than I expected
Pickling is generally used to store raw data, not to pass a Pandas DataFrame object. When you try to pickle it, it will just store the top level module name, Series, in this case.
1) You can write only the data from the DataFrame to a csv file.
# Write/read csv file using DataFrame object's "to_csv" method.
import pandas as pd
new_income_df.to_csv("mydata.csv")
new_income_df2 = pd.read_csv("mydata.csv")
2) If your data can be saved as a function in a regular python module with a *.py name, you can call it from a Jupyter notebook. You can also reload the function after you have changed the values inside. See autoreload ipynb documentation: https://ipython.org/ipython-doc/3/config/extensions/autoreload.html
# Saved as "mymodule1.py" (from notebook1.ipynb).
import pandas as pd
def funcdata():
new_income_df = pd.DataFrame(data=[100, 101])
return new_income_df
# notebook2.ipynb
%load_ext autoreload
%autoreload 2
import pandas as pd
import mymodule1.py
df2 = mymodule1.funcdata()
print(df2)
# Change data inside fucdata() in mymodule1.py and see if it changes here.
3) You can share data between Jupyter notebooks using %store command.
See src : https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/
And: Share data between IPython Notebooks
# %store example, first Jupyter notebook.
from sklearn import datasets
dataset = datasets.load_iris()
%store dataset
# from a new Jupyter notebook read.
%store -r dataset
Related
(Very new coder, first time here, apologies if there are errors in writing)
I have a csv file I made from Excel called SouthKoreaRoads.csv and I'm supposed to read that csv file using Pandas. Below is what I used:
import pandas as pd
import os
SouthKoreaRoads = pd.read_csv("SouthKoreaRoads.csv")
I get a FileNotFoundError, and I'm really new and unsure how to approach this. Could anyone help, give advice, or anything? Many thanks in advance
just some explanation aside. Before you can use pd.read_csv to import your data, you need to locate your data in your filesystem.
Asuming you use a jupyter notebook or pyton file and the csv-file is in the same directory you are currently working in, you just can use:
import pandas as pd SouthKoreaRoads_df = pd.read_csv('SouthKoreaRoads.csv')
If the file is located in another directy, you need to specify this directory. For example if the csv is in a subdirectry (in respect to the python / jupyter you are working on) you need to add the directories name. If its in folder "data" then add data in front of the file seperated with a "/"
import pandas as pd SouthKoreaRoads_df = pd.read_csv('data/SouthKoreaRoads.csv')
Pandas accepts every valid string path and URLs, thereby you could also give a full path.
import pandas as pd SouthKoreaRoads_df = pd.read_csv('C:\Users\Ron\Desktop\Clients.csv')
so until now no OS-package needed. Pandas read_csv can also pass OS-Path-like-Objects but the use of OS is only needed if you want specify a path in a variable before accessing it or if you do complex path handling, maybe because the code you are working on needs to run in a nother environment like a webapp where the path is relative and could change if deployed differently.
please see also:
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
https://docs.python.org/3/library/os.path.html
BR
SouthKoreaRoads = pd.read_csv("./SouthKoreaRoads.csv")
Try this and see whether it could help!
Try to put the full path, like "C:/users/....".
I just want to import this csv file. It can read it but somehow it doesn't create columns. Does anyone know why?
This is my code:
import pandas as pd
songs_data = pd.read_csv('../datasets/spotify-top50.csv', encoding='latin-1')
songs_data.head(n=10)
Result that I see in Jupyter:
P.S.: I'm kinda new to Jupyter and programming, but after all I found it should work properly. I don't know why it doesn't do it.
To properly load a csv file you should specify some parameters. for example in you case you need to specify quotechar:
df = pd.read_csv('../datasets/spotify-top50.csv',quotechar='"',sep=',', encoding='latin-1')
df.head(10)
If you still have a problem you should have a look at your CSV file again and also pandas documentation, so that you can set parameters to match with your CSV file structure.
I wrote a python script to get data from my Gmail account which I imported as a pandas dataframe into a Jupyter notebook. The notebook is called "Automation via Gmail API" and the dataframe is simply called "df". Now I want to use this df to update a Google Sheet via the Google Sheets API. To this end I created another notebook - "Automation via Sheets API". But how can I access df in the "Automation via Sheets API" notebook? Apparently, Jupyter provides some functionality to load a notebook into another notebook. I simply copy and pasted the code of the "Notebook Loader" into my Sheets-notebook and only changed "path" and "fullname", but it doesn't work and I don't have a clue why:
#Load df from the "Automation via Gmail API" notebook.
fullname = "Automation via Gmail API.ipynb"
class NotebookLoader(object):
"""Module Loader for Jupyter Notebooks"""
def __init__(self, path="C:\\Users\\Moritz Wolff\\Desktop\\gmail automatisierung\\Gmail API"):
self.shell = InteractiveShell.instance()
self.path = path
def load_module(self, fullname="Automation via Gmail API.ipynb"):
"""import a notebook as a module"""
path = find_notebook(fullname, self.path)
[...]
There is no error-message. Is my strategy flawed from the start or do I simply miss a little detail? Any help is appreciated.
A direct option is to save the dataframe as a text table in the original notebook and read it into the other. Instead of plain text you can also save the dataframe itself as serialized Python for a little more efficiency/convenience.
Options from source notebook:
df.to_csv('example.tsv', sep='\t') # add `, index = False` to leave off index
# -OR-
df.to_pickle("file_name.pkl")
Options in reading notebook:
import pandas as pd
df = pd.read_csv('example.tsv', sep='\t')
#-OR-
df = pd.read_pickle("file_name.pkl")
I used tab delimited tabular text structure, but you are welcome to use comma-separated.
I would avoid loading your notebook from another notebook unless you are sure that is how you want to approach your problem.
You can always export your dataframe to a csv using pandas.DataFrame.to_csv()
, then load it in your other notebook with pandas.read_csv()
import pandas as pd
df = ['test','data']
df.to_csv('data1.csv')
Then in your other notebook:
df = pd.read_csv('data1.csv', index_col = 0)
Alternatively you can try using the %store magic function:
df = ['test','data']
%store df
Then to recall it in another notebook to retrieve it:
%store -r df
One constraint about this method is that you have to %store your data each time the variable is updated.
Documentation: https://ipython.readthedocs.io/en/stable/config/extensions/storemagic.html
This question already exists:
Reading CSV files in Python, using Jupyter Notebook through IntelliJ IDEA
Closed 4 years ago.
Im trying to tackle the Kaggle Titanic challenge. Bear with me, as Im fairly new to data science. I was previously struggling to get the following syntax to work: my previous question(Reading CSV files in Python 3.6, using IntelliJ IDEA)
Reading CSV files in Python, using Jupyter Notebook through IntelliJ IDEA
import numpy as np
import pandas as pd
from pandas import Series,Dataframe
titanic_df = pd.read_csv('train.csv')
titanic.head()
However, using the below code, I am able to open the file and read it/print its contents, but i need to convert the data to a dataframe so that it can be worked with. Any suggestions?
file_path = '/Volumes/LACIE SETUP/Data_Science/Data_Analysis_Viz_InPython/Example_Projects/train.csv'
with open(file_path) as train_fp:
for line in train_fp:
# print(line)
This above code was able to print out the data but when I tried passing
'file_path' to:
titanic_df = pd.read_csv('file_path.csv')
i received the same error as before. Not sure what Im doing wrong. I KNOW the file 'train.csv' exists in that location because 1) i put it there and 2) its contents can be printed when pointed to its location.
So what the heck am I doing wrong??? :/
read_csv will create a Pandas DataFrame. So, as long as your file path is right, this following code should work. Also, make sure to use the file_path variable and not the string "file_path.csv"
import pandas as pd
file_path = '/Volumes/LACIE SETUP/Data_Science/Data_Analysis_Viz_InPython/Example_Projects/train.csv'
titanic_df = pd.read_csv(file_path)
titanic_df.head()
I'm having some trouble reading a csv file
import pandas as pd
df = pd.read_csv('Data_Matches_tekha.csv', skiprows=2)
I get
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 1 fields in line 526, saw 5
and when I add sep=None to df I get another error
Error: line contains NULL byte
I tried adding unicode='utf-8', I even tried CSV reader and nothing works with this file
the csv file is totally fine, I checked it and i see nothing wrong with it
Here are the errors I get:
In your actual code, the line is:
>>> pandas.read_csv("Data_Matches_tekha.xlsx", sep=None)
You are trying to read an Excel file, and not a plain text CSV which is why things are not working.
Excel files (xlsx) are in a special binary format which cannot be read as simple text files (like CSV files).
You need to either convert the Excel file to a CSV file (note - if you have multiple sheets, each sheet should be converted to its own csv file), and then read those.
You can use read_excel or you can use a library like xlrd which is designed to read the binary format of Excel files; see Reading/parsing Excel (xls) files with Python for for more information on that.
Use read_excel instead read_csv if Excel file:
import pandas as pd
df = pd.read_excel("Data_Matches_tekha.xlsx")
I have encountered the same error when I used to_csv to write some data and then read it in another script. I found an easy solution without passing by pandas' read function, it's a package named Pickle.
You can download it by typing in your terminal
pip install pickle
Then you can use for writing your data (first) the code below
import pickle
with open(path, 'wb') as output:
pickle.dump(variable_to_save, output)
And finally import your data in another script using
import pickle
with open(path, 'rb') as input:
data = pickle.load(input)
Note that if you want to use, when reading your saved data, a different python version than the one in which you saved your data, you can precise that in the writing step by using protocol=x with x corresponding to the version (2 or 3) aiming to use for reading.
I hope this can be of any use.