IPython notebook read string from raw text cell - python

I have a raw text cell in my IPython notebook project.
Is there a way to get the text as a string with a build in function or something similar?

My (possibly unsatisfactory) answer is in two parts. This is based on a personal investigation of iPython structures, and it's entirely possible I've missed something that more directly answers the question.
Current session
The raw text for code cells entered during the current session is available within a notebook using the list In.
So the raw text of the current cell can be returned by the following expression within the cell:
In[len(In)-1]
For example, evaluating a cell containing this code:
print "hello world"
three = 1+2
In[len(In)-1]
yields this corresponding Out[] value:
u'print "hello world"\nthree = 1+2\nIn[len(In)-1]'
So, within an active notebook session, you can access the raw text of cell as In[n], where n is the displayed index of the required cell.
But if the cell was entered during a previous Notebook session, which has subsequently been closed and reopened, that no longer works. Also, only code cells seem to be included in the In array.
Also, this doesn't work for non-code cells, so wouldn't work for a raw text cell.
Cells from saved notebook sessions
In my research, the only way I could uncover to get the raw text from previous sessions was to read the original notebook file. There is a documentation page Importing IPython Notebooks as Modules describing how to do this. The key code is in In[4]:
# load the notebook object
with io.open(path, 'r', encoding='utf-8') as f:
nb = current.read(f, 'json')
where current is an instance of the API described at Module: nbformat.current.
The notebook object returned is accessed as a nested dictionary and list structure, e.g.:
for cell in nb.worksheets[0].cells:
...
The cell objects thus enumerated have two key fields for the purpose of this question:
cell.cell_type is the type of the cell ("code", "markdown", "raw", etc.).
cell.input is the raw text content of the cell as a list of strings, with an entry for each line of text.
Much of this can be seen by looking at the JSON data that constitutes a saved iPython notebook.
Apart from the "prompt number" fields in a notebook, which seem to change whenever the field is re-evaluated, I could find no way to create a stable reference to a notebook cell.
Conclusion
I couldn't find an easy answer to the original question. What I found is covered above. Without knowing the motivation behind the original question, I can't know if it's enough.
What I looked for, but was unable to identify, was a way to reference the current notebook that can be used from within the notebook itself (e.g. via a function like get_ipython()). That doesn't mean it doesn't exist.
The other missing piece in my response is any kind of stable way to refer to a specific cell. (e.g. Looking at the notebook file format, raw text cells consist solely of a cell type ("raw") and the raw text itself, though it appears that cell metadata might also be included.) This suggests the only way to directly reference a cell is through its position in the notebook, but that is subject too change when the notebook is edited.
(Researched and answered as part of the Oxford participation in http://aaronswartzhackathon.org)

I am not allowed to comment due to my lack of reputation so I will just post as answer an update to Graham Klyne's answer, in case someone else stumble into this. (Ipython has no updated documentation yet to date)
Use nbformat instead of Ipython.nbformat.current
The worksheets attribute is gone so use cells directly.
I have an example of how the updated code will look like:
https://github.com/ldiary/marigoso/blob/master/marigoso/NotebookImport.py

Related

Jupyter Notebook issue

I ran some commands on Jupyter Notebook and expected to get a printed output containing data in tabulated form in a .csv file, but then i get an uncompleted output
This is the result i get from the .csv file
I ran this command;
df1=pandas.read_csv("supermarkets.csv", on_bad_lines='skip')
df1
I expected to get a printed output in a tabulated like in the image attached......
The data get printed in well tabulated form here
Here is a link to the online version of the file
[pythonhow.com/supermarkets.csv]
Getting good, clean quality data where the file extension correctly matches the actual content is often a challenge. Assessing the state of the input data is generally always a very important first step.
It appears the data you are trying to get is also online here. Github will render that as a table in the browser because it has a viewer mode. To look at the 'raw' file content, click here. You'll see it is nice comma-delimited file with columns separated by commas and rows each on a different line. The header with the column names is on the first line.
Now open in a good text editor the file you have that you are working with and compare it to the content I pointed you at. That should guide you on what is the issue.
At this point you may just wish to switch to using the version of the file that I pointed you at.
Use the link below to obtain it as proper csv file:
https://raw.githubusercontent.com/kenvilar/data-analysis-using-python/master/supermarkets.csv
You should be able to paste that link in your browser and then right click on the page and choose 'Save as..' to download it to your locak machine. The obtained file should open just fine using the code you showed in the screenshot in your post here.
Please work on writing better questions with specific titles, see here for guidance. The title at present is overly broad and is actually not accurate. This code would not work with the data you apparently have even if you were running it inside a Python code-based script. And so it is not a Jupyter notebook issue. For how to think about making it specific, a good thing to keep in mind is to write for your future self. If you continue to use notebooks you'll have hundreds that would be considered a 'Jupyter Notebook issue', but what makes this issue different from those?
I believe there is an issue with your csv file, not the code.
To me it looks like the data in your csv file are written in json format.
Have you opened the supermarkets.csv file using excel? it should look like a table, not a json formatted file.
did you try df1.show() to see if the csv got read in the first place?

Unable to contain text in tabulate cell when changing table format

I'm using tabulate to create some nice tables for a lab, though I'm running into an issue. I have a "comments" section where the tables include, you guessed it, comments. Only thing is when I apply tablefmt="fancy_grid", the text wraps all the way through and doesn't stay contained in the cell. Even without using "fancy_grid", I had to limit the word count, because it would go over the cell width and break the table. I'll include a picture below.
My question would be - is there a way to apply a class that could contain the text and make it wrap within that cell, rather than wrap through the entire table width? Nothing on the pypi tabulate page shows any solutions to this either, you can format columns, though I didn't see anything related to this specific issue. I know about using p{width} through LaTeX, though I decided to go with tabulate from the start instead of using LaTeX, not a big fan of the notation.
Regular table with contained cell content:
Properly structured table
Broken table due to text overflow:
Improperly structured table
Cheers.

Adding new text into excel cell with another format using python & xlsxwriter

Hello I've recently embarked on a project that allows me to input some data into a python programme using Tinker. This the programme interface.
With this input after clicking "Go" it'll open an excel spreadsheet and write the start and end time for that specific date. My question is how do I write a code to have a different colour of text for the NEW text without altering say what was in the cell originally using xlsxwriter? Here's an example.
This is the original text/format for say 5th May cell in my excelsheet.
And after clicking Go, I hope to achieve this:
The coding of opening excel, writing, finding the cell, and saving. I'm ok with that.
I hope this is a clear enough question and hopefully it's an answer I can use!
Thanks!!
I think the GetCharacters function done on a Range, using win32com (pywincom) will do what you want.
ws.Range(cell/range as string).GetCharacters(start,end).Font.Color = [color ID]
After opening the workbook, I was able to do this to make characters 2-5 as Red:
ws.Range('A1').GetCharacters(2,5).Font.Color = -16776961
I got a lot of this from a previous question looking at bolding: How Do I Bold only part of a string in an excel cell with python
To get the color (and there is probably a better way), I went into Excel and used the macro recorded, and just changed the font the red and saw what the macro recorded called that color. So you could get the number ID for the colors you want from that.

Identify the edited location in the PDF modified by online editor www.ilovepdf.com using Python

I have an SBI bank statement PDF which is tampered/forged. Here is the link for the PDF.
This PDF is edited using online editor www.ilovepdf.com. The edited part is the first entry under the 'Credit' column. Original entry was '2,412.00' and I have modified it to '12.00'.
Is there any programmatic way either using Python or any other opensource technology to identify the edited/modified location/area of the PDF (i.e. BBOX(Bounding Box) around 12.00 credit entry in this PDF)?
2 things I already know:
Metadata (Info or XMP metadata) is not useful. Modify date of the metadata doesn't confirm if the PDF is compressed or indeed edited, it will change the modify date in both these cases. Also it doesn't give the location of the edit been done.
PyMuPDF SPANS JSON object is also not useful as the edited entry doesn't come at the end of the SPANS JSON, instead it's in the proper order of the text inside the PDF. Here is the SPAN JSON file generated from PyMuPDF.
Kindly let me know if anyone has any opensource solution to resolve this problem.
iLovePDF completely changes the whole text in the document. You can even see this, just open the original and the manipulated PDFs in two Acrobat Reader tabs and switch back and forth between them, you'll see nearly all letters move a bit.
Internally iLovePDF also rewrote the PDF completely according to its own preferences, and the edit fits in perfectly.
Thus, no, you cannot recognize the manipulated text based on this document alone because it technically is a completely different, a completely new one.

Why does Python str.len() see a certain cell in my Pandas object to be 6889, but SQL SERVER len and NOTEPAD++ displays 8002?

I'm using Python to parse a large XML file that we were able to acquire from a 3rd party vendor and i have a predetermined destination table that the length is varchar(8000). I have a logical check within my Jupyter notebook that if the length of a Pandas cell is over 8000, i split it, else just allow it to pass through.
Now, when i trying to do a bulk insert with the CSV output from the Jupyter notebook, I'm getting truncation errors, when i looked into it, a specific row, when doing an df["column"].str.len() returns a length of 6889, but when the exact string, copied and pasted into SSMS and NOTEPAD++, returns 8002, which of course is more than the varchar(8000) field size.
Im trying to look for specific parameters to the str.len() in Python/ Pandas but can't seem to find any, can anyone please explain why?
Thank you and more power!

Categories

Resources