Importing a Dataframe from one Jupyter Notebook into another Jupyter Notebook - python

I wrote a python script to get data from my Gmail account which I imported as a pandas dataframe into a Jupyter notebook. The notebook is called "Automation via Gmail API" and the dataframe is simply called "df". Now I want to use this df to update a Google Sheet via the Google Sheets API. To this end I created another notebook - "Automation via Sheets API". But how can I access df in the "Automation via Sheets API" notebook? Apparently, Jupyter provides some functionality to load a notebook into another notebook. I simply copy and pasted the code of the "Notebook Loader" into my Sheets-notebook and only changed "path" and "fullname", but it doesn't work and I don't have a clue why:
#Load df from the "Automation via Gmail API" notebook.
fullname = "Automation via Gmail API.ipynb"
class NotebookLoader(object):
"""Module Loader for Jupyter Notebooks"""
def __init__(self, path="C:\\Users\\Moritz Wolff\\Desktop\\gmail automatisierung\\Gmail API"):
self.shell = InteractiveShell.instance()
self.path = path
def load_module(self, fullname="Automation via Gmail API.ipynb"):
"""import a notebook as a module"""
path = find_notebook(fullname, self.path)
[...]
There is no error-message. Is my strategy flawed from the start or do I simply miss a little detail? Any help is appreciated.

A direct option is to save the dataframe as a text table in the original notebook and read it into the other. Instead of plain text you can also save the dataframe itself as serialized Python for a little more efficiency/convenience.
Options from source notebook:
df.to_csv('example.tsv', sep='\t') # add `, index = False` to leave off index
# -OR-
df.to_pickle("file_name.pkl")
Options in reading notebook:
import pandas as pd
df = pd.read_csv('example.tsv', sep='\t')
#-OR-
df = pd.read_pickle("file_name.pkl")
I used tab delimited tabular text structure, but you are welcome to use comma-separated.

I would avoid loading your notebook from another notebook unless you are sure that is how you want to approach your problem.
You can always export your dataframe to a csv using pandas.DataFrame.to_csv()
, then load it in your other notebook with pandas.read_csv()
import pandas as pd
df = ['test','data']
df.to_csv('data1.csv')
Then in your other notebook:
df = pd.read_csv('data1.csv', index_col = 0)
Alternatively you can try using the %store magic function:
df = ['test','data']
%store df
Then to recall it in another notebook to retrieve it:
%store -r df
One constraint about this method is that you have to %store your data each time the variable is updated.
Documentation: https://ipython.readthedocs.io/en/stable/config/extensions/storemagic.html

Related

How do I get my JupyterLab to show the values of my Dataframe?

So my goal is to print the values of my dataframe. Yet that does not seem to be working. I created a dictionary, then used that dictionary to create a dataframe. The console doesn't show anything when i print use print statement. The dictionary isn't even printing. What should I do?
import pandas as pd
import numpy as np
data = {"school" : ['amundsen', 'clemente', 'corliss','douglass','eric hs','enger','gage park','harlan','hirsch','hubbard','juarez','kelly'],
"DO2019" : [6,0.9,2.3,0.6,3,16.5,10.6,10.3,11.2,7.4,5.5,5.5],
"median_income" : [50065, 58987, 40394,28059,42809,40394,37367,40394,40176,37367,42575,42809]
}
print(data)
df = pd.DataFrame.from_dict(data)
print(df)
I was able to display the data frame of your sample code in Jupyter notebook, can you open any new Python notebook and see if you can display the sample output, if not then onew way is to stop the notebook and restart again, sometimes this has happened to me and restarting helped me.
First you need to run jupyter notebook command and it will open a tree of your files in your browser. Then you can create a new notebook and copy this code in a cell and run it using the play button at the top or pressing shift+endter.
you dont need print(df) only place df in the cell and run it:
df = pd.DataFrame.from_dict(data)
df
it will show your df and if you want you can only show head or tail of the df:
df.head(5)
df.tail(5)

Exporting scraped content to google sheets

I am willing to scrape a website for some information. It would be 3 to 4 columns. The difficult part is, i want to export all the data in to the google sheets and make the crawler run after some specific intervals. I 'll be using scrapy for this purpose. Any suggestions on how can i do this (by making custom pipeline or any other way as i don't have much experience in writing custom pipelines)
You can use the Google API and python pygsheets module.
Refer this link for more details Click Here
Please see the sample code and this might help you.
import pygsheets
import pandas as pd
#authorization
gc = pygsheets.authorize(service_file='/Users/desktop/creds.json')
# Create empty dataframe
df = pd.DataFrame()
# Create a column
df['name'] = ['John', 'Steve', 'Sarah']
#open the google spreadsheet (where 'PY to Gsheet Test' is the name of my sheet)
sh = gc.open('PY to Gsheet Test')
#select the first sheet
wks = sh[0]
#update the first sheet with df, starting at cell B2.
wks.set_dataframe(df,(1,1))

Pickle encoding utf-8 issue

I'm trying to pickle a pandas dataframe to my local directory so I can work on it in another jupyter notebook. The write appears to go successful at first but when trying to read it in a new jupyter notebook the read is unsuccessful.
When I open the pickle file I appear to have wrote, the file's only contents are:
Error! /Users/.../income.pickle is not UTF-8 encoded
Saving disabled.
See console for more details.
I also checked and the pickle file itself is only a few kilobytes.
Here's my code for writing the pickle:
with open('income.pickle', 'wb', encoding='UTF-8') as to_write:
pickle.dump(new_income_df, to_write)
And here's my code for reading it:
with open('income.pickle', 'rb') as read_file:
income_df = pickle.load(read_file)
Also when I return income_df I get this output:
Series([], dtype: float64)
It's an empty series that I errors on when trying to call most series methods on it.
If anyone knows a fix for this I'm all ears. Thanks in advance!
EDIT:
This is the solution I arrived at:
with open('cleaned_df', 'wb') as to_write:
pickle.dump(df, to_write)
with open('cleaned_df','rb') as read_file:
df = pickle.load(read_file)
Which was much simpler than I expected
Pickling is generally used to store raw data, not to pass a Pandas DataFrame object. When you try to pickle it, it will just store the top level module name, Series, in this case.
1) You can write only the data from the DataFrame to a csv file.
# Write/read csv file using DataFrame object's "to_csv" method.
import pandas as pd
new_income_df.to_csv("mydata.csv")
new_income_df2 = pd.read_csv("mydata.csv")
2) If your data can be saved as a function in a regular python module with a *.py name, you can call it from a Jupyter notebook. You can also reload the function after you have changed the values inside. See autoreload ipynb documentation: https://ipython.org/ipython-doc/3/config/extensions/autoreload.html
# Saved as "mymodule1.py" (from notebook1.ipynb).
import pandas as pd
def funcdata():
new_income_df = pd.DataFrame(data=[100, 101])
return new_income_df
# notebook2.ipynb
%load_ext autoreload
%autoreload 2
import pandas as pd
import mymodule1.py
df2 = mymodule1.funcdata()
print(df2)
# Change data inside fucdata() in mymodule1.py and see if it changes here.
3) You can share data between Jupyter notebooks using %store command.
See src : https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/
And: Share data between IPython Notebooks
# %store example, first Jupyter notebook.
from sklearn import datasets
dataset = datasets.load_iris()
%store dataset
# from a new Jupyter notebook read.
%store -r dataset

How to use the Pandas 'sep' command in Google Colab?

So, I used Jupyter Notebook and there using the 'sep' command was pretty simple. But now I'm slowly migrating to Google Colab, and while I can find the file and build the DataFrame with 'pd.read_csv()', I can't seem to separate the columns with the 'sep = ' command!
I mounted the Drive and located the file:
import pandas as pd
from google.colab import drive
drive.mount('/content/gdrive')
with open('/content/gdrive/My Drive/wordpress/cousins.csv','r') as f:
f.read()
Then I built the Dataframe:
df = pd.read_csv('/content/gdrive/My Drive/wordpress/cousins.csv',sep=";")
The dataframe is built, but it is not separated by columns! Below is a screenshot:
Built DataFrame
Last edit: Turns out the problem was with the data I was trying to use, because it also didn't work on Jupyter. There is no problem with the 'sep' command the way it was being used!
PS: I also tried 'sep='.'' and 'sep = ','' to see if it works, and nothing.
I downloaded the data as a 'csv' table from Football-Reference, paste it on excel, saved as a csv (UTF-8), an example of the file can be found here:
Pastebin Example File
This works for me:
My data:
a,b,c
5,6,7
8,9,10
You don't need sep for comma separated file.
from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
# suppose I have data in my Google Drive in the file path
# GoogleColaboratory/data/so/a.csv
# The folder GoogleColaboratory is in my Google Drive.
df = pd.read_csv('drive/My Drive/GoogleColaboratory/data/so/a.csv')
df.head()
Instead of
df = pd.read_csv('/content/gdrive/My Drive/wordpress/cousins.csv', sep=";")
Use
df = pd.read_csv('/content/gdrive/My Drive/wordpress/cousins.csv', delimiter=";")

Load xlsx file from drive in colaboratory

How can I import MS-excel(.xlsx) file from google drive into colaboratory?
excel_file = drive.CreateFile({'id':'some id'})
does work(drive is a pydrive.drive.GoogleDrive object). But,
print excel_file.FetchContent()
returns None. And
excel_file.content()
throws:
TypeErrorTraceback (most recent call last)
in ()
----> 1 excel_file.content()
TypeError: '_io.BytesIO' object is not callable
My intent is (given some valid file 'id') to import it as an io object, which could be read by pandas read_excel(), and finally get a pandas dataframe out of it.
You'll want to use excel_file.GetContentFile to save the file locally. Then, you can use the Pandas read_excel method after you !pip install -q xlrd.
Here's a full example:
https://colab.research.google.com/notebook#fileId=1SU176zTQvhflodEzuiacNrzxFQ6fWeWC
What I did in more detail:
I created a new spreadsheet in sheets to be exported as an .xlsx file.
Next, I exported it as an .xlsx file and uploaded again to Drive. The URL is:
https://drive.google.com/open?id=1Sv4ib5i7CKWhAHZkKg-uitIkS3xwxtXM
Note the file ID. In my case it's 1Sv4ib5i7CKWhAHZkKg-uitIkS3xwxtXM.
Then, in Colab, I tweaked the Drive download snippet to download the file. The key bits are:
file_id = '1Sv4ib5i7CKWhAHZkKg-uitIkS3xwxtXM'
downloaded = drive.CreateFile({'id': file_id})
downloaded.GetContentFile('exported.xlsx')
Finally, to create a Pandas DataFrame:
!pip install -q xlrd
import pandas as pd
df = pd.read_excel('exported.xlsx')
df
The !pip install... line installs the xlrd library, which is needed to read Excel files.
Perhaps a simpler method:
#To read/write data from Google Drive:
#Reference: https://colab.research.google.com/notebooks/io.ipynb#scrollTo=u22w3BFiOveAĆ„
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_excel('/content/drive/My Drive/folder_name/file_name.xlsx')
# #When done,
# drive.flush_and_unmount()
# print('All changes made in this colab session should now be visible in Drive.')
First, I import io, pandas and files from google.colab
import io
import pandas as pd
from google.colab import files
Then I upload the file using an upload widget
uploaded = files.upload()
You will something similar to this (click on Choose Files and upload the xlsx file):
Let's suppose that the name of the files is my_spreadsheet.xlsx, so you need to use it in the following line:
df = pd.read_excel(io.BytesIO(uploaded.get('my_spreadsheet.xlsx')))
And that's all, now you have the first sheet in the df dataframe. However, if you have multiple sheets you can change the code into this:
First, move the io call to another variable
xlsx_file = io.BytesIO(uploaded.get('my_spreadsheet.xlsx'))
And then, use the new variable to specify the sheet name, like this:
df_first_sheet = pd.read_excel(xlsx_file, 'My First Sheet')
df_second_sheet = pd.read_excel(xlsx_file, 'My Second Sheet')
import pandas as pd
xlsx_link = 'https://docs.google.com/spreadsheets/d/1Sv4ib5i7CKWhAHZkKg-uitIkS3xwxtXM/export'
df = pd.read_excel(xlsx_link)
if the xlsx is hosted on Google drive, once shared, anyone can use link to access it, with or without google account. google.colab.drive or google.colab.files dependencies are not necessary
Easiest way I found so far.
Pretty similar to what we do on desktop.
Considering you uploaded the file to your Google Drive folder:
On the left bar click on Files ( below the {x} )
Select Mount Driver > drive > folder > file (left click and Copy Path)
After that just go to the code and past the path
pd.read_excel('/content/drive/MyDrive/Colab Notebooks/token_rating.xlsx')

Categories

Resources