Display interactive table - python

Is it possible in jupyter to display tabular data in some interactive format?
So that for example following data
A,x,y
a,0,0
b,5,2
c,5,3
d,0,3
will be scrollable and sortable by A,x and y columns?

Yes, it is possible.
First install itables
!pip install itables
In the next step import module and turn on interactive mode:
from itables import init_notebook_mode
init_notebook_mode(all_interactive=True)
Let's load your data to pandas dataframe:
data = {
'A' : ['a', 'b', 'c', 'd'],
'x' : [0, 5, 5, 0],
'y' : ['0', '2', '3', '3'],
}
df = pd.DataFrame(data)
df
see the result

Related

qgrid not showing output Python

I am trying to run the below mentioned query. It was also executed successfully but is not showing any kind of output. I am totally clueless about the same thing.
Why is this happening?
import pandas as pd
import qgrid
df = pd.DataFrame({'A': [1.2, 'foo', 4], 'B': [3, 4, 5]})
df = df.set_index(pd.Index(['bar', 7, 3.2]))
view = qgrid.show_grid(df, grid_options={'fullWidthRows': True}, show_toolbar=True)
view
also see My attached Screen shot for the same.

How to colorize all rows of a dataframe based on values of a column dynamically?

Let's say I have a dataframe like the following;
df = pd.DataFrame({'Sample': ['A', 'B', 'C', 'D','E','F'],
'NFW': [8.16, 8.63, 9.25, 8.97, 7.5, 8.21],
'Qubit': [55, 100, 229, 30, 42, 33],
'Lane': ['1', '1', '2', '2', '3', '3']})
I want to colorize the background of all rows based on the values of the Lane column dynamically. Also, I'll write this dataframe to an excel file and I need to keep all style changes in there too.
IIUC, you can use a colormap (for example from matplotlib) and map it to each row using a Categorical and style.apply:
import matplotlib
cmap = matplotlib.cm.get_cmap('Pastel1')
colors = [matplotlib.colors.rgb2hex(cmap(i)) for i in range(cmap.N)]
# ['#fbb4ae', '#b3cde3', '#ccebc5', '#decbe4', '#fed9a6', '#ffffcc', '#e5d8bd', '#fddaec', '#f2f2f2']
def color(df, colors=colors):
# get unique values as category
s = df['Lane'].astype('category')
# map the colors and create CSS string
s = ('background-color: '
+s.cat.rename_categories(colors[:len(s.cat.categories)])
.astype(str)
)
# expand to DataFrame size
return pd.DataFrame(np.tile(s, (df.shape[1], 1)).T,
index=df.index, columns=df.columns)
df.style.apply(color, axis=None)

Comparing date columns between two dataframes

The project I'm working on requires me to find out which 'project' has been updated since the last time it was processed. For this purpose I have two dataframes which both contain three columns, the last one of which is a date signifying the last time a project is updated. The first dataframe is derived from a query on a database table which records the date a 'project' is updated. The second is metadata I store myself in a different table about the last time my part of the application processed a project.
I think I came pretty far but I'm stuck on the following error, see the code provided below:
lastmatch = pd.DataFrame({
'projectid': ['1', '2', '2', '3'],
'stage': ['c', 'c', 'v', 'v'],
'lastmatchdate': ['2020-08-31', '2013-11-24', '2013-11-24',
'2020-08-31']
})
lastmatch['lastmatchdate'] = pd.to_datetime(lastmatch['lastmatchdate'])
processed = pd.DataFrame({
'projectid': ['1', '2'],
'stage': ['c', 'v'],
'process_date': ['2020-08-30', '2013-11-24']
})
processed['process_date'] = pd.to_datetime(
processed['process_date']
)
unprocessed = lastmatch[~lastmatch.isin(processed)].dropna()
processed.set_index(['projectid', 'stage'], inplace=True)
lastmatch.set_index(['projectid', 'stage'], inplace=True)
processed.sort_index(inplace=True)
lastmatch.sort_index(inplace=True)
print(lastmatch['lastmatchdate'])
print(processed['process_date'])
to_process = lastmatch.loc[lastmatch['lastmatchdate'] > processed['process_date']]
The result I want to achieve is a dataframe containing the rows where the 'lastmatchdate' is greater than the date that the project was last processed (process_date). However this line:
to_process = lastmatch.loc[lastmatch['lastmatchdate'] > processed['process_date']]
produces a ValueError: Can only compare identically-labeled Series objects. I think it might be a syntax I don't know of or got wrong.
The output I expect is in this case:
lastmatchdate
projectid stage
1 c 2020-08-31
So concretely the question is: how do I get a dataframe containing only the rows of another dataframe having the (datetime) value of column a greater than column b of the other dataframe.
merged = pd.merge(processed, lastmatch, left_index = True, right_index = True)
merged = merged.assign(to_process = merged['lastmatchdate']> merged['process_date'])
You will get the following:
process_date lastmatchdate to_process
projectid stage
1 c 2020-08-31 2020-08-31 False
2 v 2013-11-24 2013-11-24 False
you 've receiver ValueError because you tried to compare two different dataframes, if you want to compare row by row two dataframes, merge them before
lastmatch = pd.DataFrame({
'projectid': ['1', '2', '2', '3'],
'stage': ['c', 'c', 'v', 'v'],
'lastmatchdate': ['2020-08-31', '2013-11-24', '2013-11-24',
'2020-08-31']
})
lastmatch['lastmatchdate'] = pd.to_datetime(lastmatch['lastmatchdate'])
processed = pd.DataFrame({
'projectid': ['1', '2'],
'stage': ['c', 'v'],
'process_date': ['2020-08-30', '2013-11-24']
})
processed['process_date'] = pd.to_datetime(
processed['process_date']
)
df=pd.merge(lastmatch,processed,on=['stage','projectid'])
df=df[
df.lastmatchdate>df.process_date
]
print(df)
projectid stage lastmatchdate process_date
0 1 c 2020-08-31 2020-08-30

Unable to train or test data

I have this code as mentioned by this gentleman in "https://github.com/venky14/Machine-Learning-with-Iris-Dataset/blob/master/Machine%20Learning%20with%20Iris%20Dataset.ipynb"
After splitting the data into training and testing , I am unable to taking the features for training and testing data.Error is being thrown at In[92].
It is giving me
error "KeyError: "['A' 'B' 'C' 'D' 'E' 'F' 'H' 'I'] not in index""
Below is image of how my CSV file looks like
It seems that you are calling column names as indexes.
Please provide sample code because the refed ipynb seems to be correct.
Probably you are looking for this:
import pandas as pd
df = pd.read_csv('sample-table.csv')
df_selected_columns = df[['A', 'B', 'C', 'D', 'E', 'F', 'H', 'I']]
np_ndarray = df_selected_columns.values

Tab delineated python 3 .txt file reading

I am having a bit of trouble getting started on an assignment. We are issued a tab delineated .txt file with 6 columns of data and around 50 lines of this data. I need help starting a list to store this data in for later recall. Eventually I will need to be able to list all the contents of any particular column and sort it, count it, etc. Any help would be appreciated.
Edit; I really haven't done much besides research on this kinda stuff, I know ill be looking into csv, and i have done single column .txt files before but im not sure how to tackle this situation. How will I give names to the separate columns? how will I tell the program when one row ends and the next begins?
The dataframe structure in Pandas basically does exactly what you want. It's highly analogous to the data frame in R if you're familiar with that. It has built in options for subsetting, sorting, and otherwise manipulating tabular data.
It reads directly from csv and even automatically reads in column names. You'd call:
read_csv(yourfilename,
sep='\t', # makes it tab delimited
header=1) # makes the first row the header row.
Works in Python 3.
Let's say you have a csv like the following.
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
You can read them into a dictionary like so:
>>> import csv
>>> reader = csv.DictReader(open('test.csv','r'), fieldnames= ['col1', 'col2', 'col3', 'col4', 'col5', 'col6'], dialect='excel-tab')
>>> for row in reader:
... print row
{'col6': '6', 'col4': '4', 'col5': '5', 'col2': '2', 'col3': '3', 'col1': '1'}
{'col6': '6', 'col4': '4', 'col5': '5', 'col2': '2', 'col3': '3', 'col1': '1'}
{'col6': '6', 'col4': '4', 'col5': '5', 'col2': '2', 'col3': '3', 'col1': '1'}
{'col6': '6', 'col4': '4', 'col5': '5', 'col2': '2', 'col3': '3', 'col1': '1'}
{'col6': '6', 'col4': '4', 'col5': '5', 'col2': '2', 'col3': '3', 'col1': '1'}
But Pandas library might be better suited for this. http://pandas.pydata.org/pandas-docs/stable/io.html#csv-text-files
Sounds like a job better suited to a database. You should just use something like PostgreSQLs COPY FROM operation to import the CSV data into a table then use python + SQL for all your sorting, searching and matching needs.
If you feel a real database is overkill there's still options like SQLlite and BerkleyDB which both have python modules.
EDIT: BerkelyDB is deprecated but anydbm is similiar in concept.
I think using a db for 50 lines and 6 colums is overkill, so here's my idea:
from __future__ import print_function
import os
from operator import itemgetter
def get_records_from_file(path_to_file):
"""
Read a tab-deliminated file and return a
list of dictionaries representing the data.
"""
records = []
with open(path_to_file, 'r') as f:
# Use the first line to get names for columns
fields = [e.lower() for e in f.readline().split('\t')]
# Iterate over the rest of the lines and store records
for line in f:
record = {}
for i, field in enumerate(line.split('\t')):
record[fields[i]] = field
records.append(record)
return records
if __name__ == '__main__':
path = os.path.join(os.getcwd(), 'so.txt')
records = get_records_from_file(path)
print('Number of records: {0}'.format(len(records)))
s = sorted(records, key=itemgetter('id'))
print('Sorted: {0}'.format(s))
For storing records for later use, look into Python's pickle library--that'll allow you to preserve them as Python objects.
Also, note I don't have Python 3 installed on the computer I'm using right now, but I'm pretty sure this'll work on Python 2 or 3.

Categories

Resources