I want to build a data annotation interface. I read in an excel file, where only a text column is relevant. So "df" could also be replaced by a list of texts. This is my code:
import streamlit as st
import pandas as pd
import numpy as np
st.title('Text Annotation')
df = pd.read_excel('mini_annotation.xlsx')
for i in range(len(df)):
text = df.iloc[i]['Text']
st.write(f"Text {i} out of {len(df)}")
st.write("Please classify the following text:")
st.write("")
st.write(text)
text_list = []
label_list = []
label = st.selectbox("Classification:", ["HOF", "NOT","Not Sure"])
if st.button("Submit"):
text_list.append(text)
label_list.append(label)
df_annotated = pd.DataFrame(columns=['Text', 'Label'])
df_annotated["Text"] = text_list
df_annotated["Label"] = label_list
df_annotated.to_csv("annotated_file.csv", sep=";")
The interface looks like this:
However, I want the interface to display just one text, e.g. the first text of my dataset. After the user has submitted his choice via the "Submit" button, I want the first text to be gone and the second text should be displayed. This process should continue until the last text of the dataset is reached.
How do I do this?
(I am aware of the Error message, for this I just have to add a key to the selectbox, but I am not sure if this is needed in the end)
This task is solved with the help of the session state. U can read about it here: session state
import streamlit as st
import pandas as pd
import numpy as np
st.title('Text Annotation')
df = pd.DataFrame(['Text 1', 'Text 2', 'Text 3', 'Text 4'], columns=['Text'])
if st.session_state.get('count') is None:
st.session_state.count = 0
st.session_state.label_list = []
with st.form("my_form"):
st.write("Progress: ", st.session_state.count, "/", len(df))
if st.session_state.count < len(df):
st.write(df.iloc[st.session_state.count]['Text'])
label = st.selectbox("Classification:", ["HOF", "NOT","Not Sure"])
submitted = st.form_submit_button("Submit")
if submitted:
st.session_state.label_list.append(label)
st.session_state.count += 1
else:
st.write("All done!")
submitted = st.form_submit_button("Submit")
if submitted:
st.write(st.session_state.label_list)
Related
I have created a Dash app to read data from a .csv file and represent it, where the user has the option to choose which variable he wants to represent.
The problem I'm facing is that the Dash app keeps freezing or is very slow, most likely due to the amount of sheer data I'm reading (the .csv files I need to read have above 2 million lines).
Is there any way I can make it faster? Maybe optimizing my code in some way?
Any help is appreciated, thanks in advance.
import pandas as pd
import numpy as np
from matplotlib import lines
import plotly.express as px
from dash import Dash, html, dcc,Input, Output
from tkinter import Tk
from tkinter.filedialog import askopenfilename
import webbrowser
print("Checkpoint 1")
def open_file():
global df, drop_list
Tk().withdraw() # we don't want a full GUI, so keep the root window from appearing
filename = askopenfilename() # show an "Open" dialog box and return the path to the selected file
print(filename)
newfilename = filename.replace('/', '\\\\')
print(newfilename)
df = pd.read_csv ('' + newfilename, sep=";", skiprows=4, skipfooter=2, engine='python') # Read csv file using pandas
# Detect all the different signals in the csv
signals = df["Prozesstext"].unique()
signals = pd.DataFrame(signals) # dataframe creation
signals.sort_values(by=0) # after the dataframe is created it can be sorted
drop_list = [] # list used for the dropdown menu
for each in signals[0]:
drop_list.append(each)
app = Dash(__name__)
fig = px.line([]) #figure starts with an empty chart
open_file()
print("Checkpoint 2")
app.layout = html.Div([
html.H1(id = 'H1', children = 'Reading Data from CSV', style = {'textAlign':'center','marginTop':40,'marginBottom':40}),
dcc.Dropdown(drop_list[:-1],id='selection_box'),
html.Div(id='dd-output-container'),
dcc.Graph(
id='trend1',
figure=fig
)
])
webbrowser.open("http://127.0.0.1:8050", new=2, autoraise=True)
# FIRST CALLBACK
#app.callback(
Output(component_id='trend1',component_property='figure'),
Input('selection_box', 'value'),
prevent_initial_call = True
)
def update_trend1(value):
df2 = df[df['Prozesstext'].isin([value])] #without empty spaces it can be just df.column_name
return px.line(df2, x="Zeitstempel", y="Daten", title=value, markers = True) # line chart
if __name__ == '__main__':
app.run_server()
#app.run_server(debug=True)
I suggest the following:
Build a Multi-value dropdown menu with the names of all columns in the CSV file. Look at this here.
Based on the selected columns by a user, the corresponding data will be imported from the CSV file.
It is not clear in your question how you represented the csv data on Dash. I recommend dbc.Table.
By doing this, you will minimize the cost of reading the entire data from the CSV file.
I try to find a solution for the following issue.
I would like to upload an excel sheet, consisting of multiple sheets (use case here 2). Afterwards I added tabs via Streamlit and used the aggrid component to be able to change some cells. However if I change cells in Sheet 1 and jump to tab 2 and back, changes are gone. This is not the desired output, meaning that any changes done in the cell should remain.
I tried via st.cache and st.experimental_memo however without success.
My code is below
import numpy as np
import streamlit as st
import pandas as pd
from st_aggrid import GridOptionsBuilder, AgGrid, GridUpdateMode, DataReturnMode, JsCode,GridOptionsBuilder
excelfile=st.sidebar.file_uploader("Select Excel-File for cleansing",key="Raw_Data")
if excelfile==None:
st.balloons()
tab1, tab2 = st.tabs(["Sheet 1", "Sheet 2"])
#st.cache()
def load_sheet1():
sheet1=pd.read_excel(excelfile,sheet_name="Sheet1")
return sheet1
#st.cache()
def load_sheet2():
sheet1=pd.read_excel(excelfile,sheet_name="Sheet2")
return sheet1
df=load_sheet1()
with tab1:
gd = GridOptionsBuilder.from_dataframe(df)
gd.configure_pagination(enabled=True)
gd.configure_default_column(editable=True, groupable=True)
gd.configure_selection(selection_mode="multiple", use_checkbox=True)
gridoptions = gd.build()
grid_table = AgGrid(
df,
gridOptions=gridoptions,
update_mode=GridUpdateMode.SELECTION_CHANGED,
theme="material",
)
df1=load_sheet2()
with tab2:
gd = GridOptionsBuilder.from_dataframe(df1)
gd.configure_pagination(enabled=True)
gd.configure_default_column(editable=True, groupable=True)
gd.configure_selection(selection_mode="multiple", use_checkbox=True)
gridoptions = gd.build()
grid_table = AgGrid(
df1,
gridOptions=gridoptions,
update_mode=GridUpdateMode.SELECTION_CHANGED,
theme="material",
)
I also can share with you my test excel file:
Sheet 1
Col1
Col2
A
C
B
D
Sheet 2
Col3
Col4
E
G
F
H
Any kind of support how to eliminate this issue would be more than awesome
EDIT: Here is a solution without the load button.
I couldn't find a way to do it without adding a button to reload the page to apply changes. Since streamlit reruns the whole code every time you interact with it is a bit tricky to rendre elements the right way. Here is your code refactored. Hope this helps !
import streamlit as st
import pandas as pd
from st_aggrid import AgGrid, GridUpdateMode, GridOptionsBuilder
# Use session_state to keep stack of changes
if 'df' not in st.session_state:
st.session_state.df = pd.DataFrame()
if 'df1' not in st.session_state:
st.session_state.df1 = pd.DataFrame()
if 'excelfile' not in st.session_state:
st.session_state.excelfile = None
#st.cache()
def load_sheet1():
sheet1 = pd.read_excel(excelfile, sheet_name="Sheet1")
return sheet1
#st.cache()
def load_sheet2():
sheet1 = pd.read_excel(excelfile, sheet_name="Sheet2")
return sheet1
def show_table(data):
if not data.empty:
gd = GridOptionsBuilder.from_dataframe(data)
gd.configure_pagination(enabled=True)
gd.configure_default_column(editable=True, groupable=True)
gd.configure_selection(selection_mode="multiple", use_checkbox=True)
gridoptions = gd.build()
grid_table = AgGrid(
data,
gridOptions=gridoptions,
# Use MODEL_CHANGED instead of SELECTION_CHANGED
update_mode=GridUpdateMode.MODEL_CHANGED,
theme="material"
)
# Get the edited table when you make changes and return it
edited_df = grid_table['data']
return edited_df
else:
return pd.DataFrame()
excelfile = st.sidebar.file_uploader("Select Excel-File for cleansing", key="Raw_Data")
if st.session_state.excelfile != excelfile:
st.session_state.excelfile = excelfile
try:
st.session_state.df = load_sheet1()
st.session_state.df1 = load_sheet2()
except:
st.session_state.df = pd.DataFrame()
st.session_state.df1 = pd.DataFrame()
tab1, tab2 = st.tabs(["Sheet 1", "Sheet 2"])
with tab1:
# Get the edited DataFrame from the ag grid object
df = show_table(st.session_state.df)
with tab2:
# Same thing here...
df1 = show_table(st.session_state.df1)
# Then you need to click on a button to make the apply changes and
# reload the page before you go to the next tab
if st.button('Apply changes'):
# Store new edited DataFrames in session state
st.session_state.df = df
st.session_state.df1 = df1
# Rerun the page so that changes apply and new DataFrames are rendered
st.experimental_rerun()
After loading your file and making your changes in the first tab hit the "apply changes" button to reload the page before moving to the second tab.
When I run this script, I do not get an output. It appears as if it is successful as I do not get any errors telling me otherwise. When I run the notebook, a cell appears below the 5th cell, indicating that the script ran successfully, but there's nothing populated. All of my auth is correct as when I use the same auth in postman to pull tag data values, it's successful. This script used to run fine and output a table in addition to a graph.
What gives? Any help would be greatly appreciated.
Sample dataset when pulling tag data values from the Azure API
"c": 100,
"s": "opc",
"t": "2021-06-11T16:45:55.04Z",
"v": 80321248.5
#Code
import pandas as pd
from modules.services_factory import ServicesFactory
from modules.data_service import TagDataValue
from modules.model_service import ModelService
from datetime import datetime
import dateutil.parser
pd.options.plotting.backend = "plotly"
#specify tag list, start and end times here
taglist = ['c41f-ews-systemuptime']
starttime = '2021-06-10T14:00:00Z'
endtime = '2021-06-10T16:00:00Z'
# Get data and model services.
services = ServicesFactory('local.settings.production.json')
data_service = services.get_data_service()
tagvalues = []
for tag in taglist:
for tagvalue in data_service.get_tag_data_values(tag, dateutil.parser.parse(starttime), dateutil.parser.parse(endtime)):
tagvaluedict = tagvalue.__dict__
tagvaluedict['tag_id'] = tag
tagvalues.append(tagvaluedict)
df = pd.DataFrame(tagvalues)
df = df.pivot(index='t',columns='tag_id')
fig = df['v'].plot()
fig.update_traces(connectgaps=True)
fig.show()
i have this function below which makes my dataframe pretty with a border and some highlighting. However, because I have used .style I cant use .to_html() to put the dataframe in the body an email so i use .render(). However, when I use render() the border formatting changes slightly. There is a picture below of what it looks like in Python which is how I want it, and another picture for how it looks in the email. Any idea how i can put the styled dataframe into a body of an email whilst keeping the formatting?
import win32com.client
import numpy as np
import pandas as pd
import datetime as dt
import time
import os
curr_date = dt.datetime.now().strftime("%Y%m%d")
csv = pd.read_csv("Pipeline_Signals_" + curr_date + ".csv", delimiter = ',')
df = pd.DataFrame(csv)
df = df1.replace(np.nan, '', regex=True)
def _color_red_or_green(val):
color = 'red' if "*" in val else 'white'
return 'background-color: %s' % color
df = (df.style
.applymap(_color_red_or_green)
.set_table_styles([{'selector': 'th', 'props': [('border-color', 'black'),('background-color', 'white'), ('border-style','solid')]}])
.hide_index()
.set_properties(**{'color': 'black',
'border-style' :'solid',
'border-color': 'black'}))
df1 = df.render()
import win32com.client
inbox = win32com.client.gencache.EnsureDispatch("Outlook.Application").GetNamespace("MAPI")
inbox = win32com.client.Dispatch("Outlook.Application")
mail = inbox.CreateItem(0x0)
mail.To = "test#test.co.uk"
mail.CC = "test#test.co.uk"
mail.Subject = "Test Signals " + curr_date
mail.HTMLBody = df1
mail.Display()
This is what the dataframe looks like in Python and what I want it to look like
This is what the dataframe looks like when i put it in the body of an email. For some reason, the borders change.
It seems that it comes from the CSS sheet used by default when your table is displayed. You should try to set the CSS "border-collapse" attribute of your table to "collapse" when you style it. If it doesn't work, try to set the "border-spacing" attribute to 0.
I can't figure this one out. It's complaining about the wb.save() line. I have no idea what's causing it after banging my head on this. I suspect it has something to do with trying to open a blank sheet and saving it after doing stuff with formatting, but I can't imagine what I'm doing there that's causing this problem. It worked fine when I opened an existing spreadsheet and did manipulations, but that required me to have an existing spreadsheet in the first place. Here, I'm trying to start a new spreadsheet from scratch.
from bs4 import BeautifulSoup
from lxml import etree
import os, codecs
import imageFilesSub
import re
import openpyxl, lxml
from openpyxl.utils import get_column_letter, column_index_from_string
homeEnv = 0 # 1 - home, 0 - work
if homeEnv:
filesDir = r'K:\Users\Johnny\My Documents\_World_of_Waterfalls\Website\tier 2 pages\tier 3 pages\tier 4 pages'
filesOutDir = r'K:\Users\Johnny\My Documents\_World_of_Waterfalls\WordPressSite'
else:
filesDir = r'..\old_travelblog_writeups'
filesOutDir = r'./'
# First get the list of files to parse
filesInDir = os.listdir(filesDir)
filesToParse = []
for file in filesInDir:
if ('travel-blog' in file) and (file.endswith('-template.html')):
filesToParse.append(file)
# Open up the travelBlog spreadsheet and set it up
wb = openpyxl.Workbook()
sheet = wb.active
sheet.name = "travelBlog List"
sheet['A1'].value = 'Blog No.'
sheet['B1'].value = 'Title'
sheet['C1'].value = 'Category'
sheet['D1'].value = 'Keyword Tags'
sheet['E1'].value = 'Excerpt'
sheet['F1'].value = 'Featured Image Filename'
sheet['G1'].value = 'Featured Image Alt Text'
sheet['H1'].value = 'Start Date'
sheet['I1'].value = 'End Date'
sheet['J1'].value = 'Old Web Address'
sheet['K1'].value = 'New Travel Blog Body Filename'
sheet['L1'].value = 'Old Travel Blog Template to parse'
sheet.freeze_panes = 'C2'
sheet.column_dimensions['A'].width = 10
sheet.column_dimensions['H','I'] = 20
sheet.column_dimensions['B','F','J','K','L'] = 40
sheet.column_dimensions['D','E'] = 50
from openpyxl.styles import Font
headerFontObj = Font(name='Arial', bold=True)
for col in range(1,sheet.max_column):
sheet.cell(row=1, column=col).font = headerFontObj
wb.save('travelBlogParsed.xlsx')
Thanks in advance,
Johnny
I figured it out. The issue was the following lines:
sheet.column_dimensions['H','I'] = 20
sheet.column_dimensions['B','F','J','K','L'] = 40
sheet.column_dimensions['D','E'] = 50
First of all, apparently, you can't use the column_dimensions method on a vector. The argument must be a string. Second, these lines were missing the .width attribute. So those lines should have been:
sheet.column_dimensions['H'].width = 20
sheet.column_dimensions['I'].width = 20
sheet.column_dimensions['B'].width = 40
...
sheet.column_dimensions['E'].width = 50