Convert A dataframe to a dictionary with custom structure - python

I have the following data frame:
3 10 6 4 timestamp
462 75.768780 21.47490 20.725380 100.00000 2020-04-08 08:30:05
463 77.612500 21.47490 21.025310 100.00000 2020-04-08 08:30:35
464 77.612500 18.41914 21.025310 100.00000 2020-04-08 08:30:40
465 75.851290 18.41914 19.776660 100.00000 2020-04-08 08:31:05
466 2.908084 16.36317 4.460456 4.48286 2020-04-08 08:31:25
I want to convert this into a format like below if the df contains a time stamp column:
{datetime.datetime(2020, 4, 8, 8, 30, 5, tzinfo=<UTC>): {'3': 75.76878,
'10': 21.474900000000005,
'6': 20.72538,
'4': 100.0},
datetime.datetime(2020, 4, 8, 8, 30, 35, tzinfo=<UTC>): {'3': 77.6125,
'10': 21.474900000000005,
'6': 21.02531,
'4': 100.0},
}
If the data frame is in the format without timestamp column:
1 2 3
36026 7.2246 5.4106 0.0
36027 7.2154 5.4115 0.0
36028 7.1923 5.4111 0.0
36029 7.2028 5.4106 0.0
36030 7.2141 5.4123 0.0
I want to convert this data frame into a format like below:
{36026: {'1': 7.2246, '2': 5.4106, '3': 0.0},
36027: {'1': 7.2154, '2': 5.4115, '3': 0.0},
36028: {'1': 7.1923, '2': 5.4111, '3': 0.0},
36029: {'1': 7.2028, '2': 5.4106, '3': 0.0},
36030: {'1': 7.2141, '2': 5.4123, '3': 0.0}}
Currently what I am trying to achieve this using df.iterrows() in a for loop and then converting it to the corresponding format.
Any Efficient solution would be great.
Thanks in advance.

Related

How to convert dictionary with multiple keys-values pairs to DataFrame

I try to clean the data with this code
empty = {}
mess = lophoc_clean.query("lop_diemquatrinh.notnull()")[['lop_id', 'lop_diemquatrinh']]
keys = []
values = []
for index, rows in mess.iterrows():
if len(rows['lop_diemquatrinh']) >4:
values.append(rows['lop_diemquatrinh'])
keys.append(rows['lop_id'])
df = pd.DataFrame(dict(zip(keys, values)), index = [0]).transpose()
df.columns = ['data']
The result is a dictionary like this
{'data': {37: '[{"date_update":"31-03-2022","diemquatrinh":"6.0"}]',
38: '[{"date_update":"11-03-2022","diemquatrinh":"6.25"}]',
44: '[{"date_update":"25-12-2021","diemquatrinh":"6.0"},{"date_update":"28-04-2022","diemquatrinh":"6.25"},{"date_update":"28-07-2022","diemquatrinh":"6.5"}]',
1095: '[{"date_update":null,"diemquatrinh":null}]'}}
However, I don't know how to make them into a DataFrame with 3 columns like this. Please help me. Thank you!
id
updated_at
diemquatrinh
38
11-03-2022
6.25
44
25-12-2021
6.0
44
28-04-2022
6.25
44
28-07-2022
6.5
1095
null
null
Here you go.
from json import loads
from pprint import pp
import pandas as pd
def get_example_data():
return [
dict(id=38, updated_at="2022-03-11", diemquatrinh=6.25),
dict(id=44, updated_at="2021-12-25", diemquatrinh=6),
dict(id=44, updated_at="2022-04-28", diemquatrinh=6.25),
dict(id=1095, updated_at=None),
]
df = pd.DataFrame(get_example_data())
df["updated_at"] = pd.to_datetime(df["updated_at"])
print(df.dtypes, "\n")
pp(loads(df.to_json()))
print()
print(df, "\n")
pp(loads(df.to_json(orient="records")))
It produces this output:
id int64
updated_at datetime64[ns]
diemquatrinh float64
dtype: object
{'id': {'0': 38, '1': 44, '2': 44, '3': 1095},
'updated_at': {'0': 1646956800000,
'1': 1640390400000,
'2': 1651104000000,
'3': None},
'diemquatrinh': {'0': 6.25, '1': 6.0, '2': 6.25, '3': None}}
id updated_at diemquatrinh
0 38 2022-03-11 6.25
1 44 2021-12-25 6.00
2 44 2022-04-28 6.25
3 1095 NaT NaN
[{'id': 38, 'updated_at': 1646956800000, 'diemquatrinh': 6.25},
{'id': 44, 'updated_at': 1640390400000, 'diemquatrinh': 6.0},
{'id': 44, 'updated_at': 1651104000000, 'diemquatrinh': 6.25},
{'id': 1095, 'updated_at': None, 'diemquatrinh': None}]
Either of the JSON datastructures
would be acceptable input
for creating a new DataFrame from scratch.

Deleting rows from a dataframe with nested for loops

Some background: MB column only consists of 1 of 2 values (M or B) while the Indent column contains int. The numbers don't necessarily follow a pattern but if it does increment, it will increment by one. The numbers can decrement by any amount. The rows are sorted in a specific order.
The goal here is to drop rows with Indent values higher than the indent value of a row that contains a "B" value in the MB column. This should only stop once the indent value is equal to or less than the row that contains the "B" value. Below is a chart demonstrating what rows should be dropped.
Sample data:
import pandas as pd
d = {'INDENT': {'0': 0, '1': 1, '2': 1, '3': 2, '4': 3, '5': 3, '6': 4, '7': 2, '8': 3}, 'MB': {'0': 'M', '1': 'B', '2': 'M', '3': 'B', '4': 'B', '5': 'M', '6': 'M', '7': 'B', '8': 'M'}}
df = pd.DataFrame(d)
Code:
My current code has issues where I cant drop the rows of the inner for loop since it isn't using iterrows. I am aware of dropping based on a conditional expression but I am unsure how to nest this correctly.
for index, row in df.iterrows():
for row in range(index-1,0,-1):
if df.loc[row].at["INDENT"] <= df.loc[index].at["INDENT"]-1:
if df.loc[row].at["MB"]=="B":
df.drop(df.index[index], inplace=True)
break
else:
break
Edit 1:
This problem can be represented graphically. This is effectively scanning a hierarchy for an attribute and deleting anything below it. The example I provided is bad since all rows that need to be dropped are simply indent 3 or higher but this can happen at any indent level.
Edit 2: We are going to cheat on this problem a bit. I won't have to generate an edge graph from scratch since I have the prerequisite data to do this. I have an updated table and sample data.
Updated Sample Data
import pandas as pd
d = {
'INDENT': {'0': 0, '1': 1, '2': 1, '3': 2, '4': 3, '5': 3, '6': 4, '7': 2, '8': 3},
'MB': {'0': 'M', '1': 'B', '2': 'M', '3': 'B', '4': 'B', '5': 'M', '6': 'M', '7': 'B', '8': 'M'},
'a': {'0': -1, '1': 5000, '2': 5000, '3': 5322, '4': 5449, '5': 5449, '6': 5621, '7': 5322, '8': 4666},
'c': {'0': 5000, '1': 5222, '2': 5322, '3': 5449, '4': 5923, '5': 5621, '6': 5109, '7': 4666, '8': 5219}
}
df = pd.DataFrame(d)
Updated Code
import matplotlib.pyplot as plt
import networkx as nx
import pandas as pd
d = {
'INDENT': {'0': 0, '1': 1, '2': 1, '3': 2, '4': 3, '5': 3, '6': 4, '7': 2, '8': 3},
'MB': {'0': 'M', '1': 'B', '2': 'M', '3': 'B', '4': 'B', '5': 'M', '6': 'M', '7': 'B', '8': 'M'},
'a': {'0': -1, '1': 5000, '2': 5000, '3': 5322, '4': 5449, '5': 5449, '6': 5621, '7': 5322, '8': 4666},
'c': {'0': 5000, '1': 5222, '2': 5322, '3': 5449, '4': 5923, '5': 5621, '6': 5109, '7': 4666, '8': 5219}
}
df = pd.DataFrame(d)
G = nx.Graph()
G = nx.from_pandas_edgelist(df, 'a', 'c', create_using=nx.DiGraph())
T = nx.dfs_tree(G, source=-1).reverse()
print([x for x in T])
nx.draw(G, with_labels=True)
plt.show()
I am unsure how to use the edges from here to identify the rows that need to be dropped from the dataframe
Not a answer, but to long for a comment:
import pandas as pd
d = {'INDENT': {'0': 0, '1': 1, '2': 1, '3': 2, '4': 3, '5': 3, '6': 4, '7': 2, '8': 3}, 'MB': {'0': 'M', '1': 'B', '2': 'M', '3': 'B', '4': 'B', '5': 'M', '6': 'M', '7': 'B', '8': 'M'}}
df = pd.DataFrame(d)
df['i'] = df['INDENT']+1
df = df.reset_index()
df = df.merge(df[['INDENT', 'index', 'MB']].rename(columns={'INDENT':'target', 'index':'ix', 'MB': 'MBt'}), left_on=['i'], right_on=['target'], how='left')
import networkx as nx
G = nx.Graph()
G = nx.from_pandas_edgelist(df, 'index', 'ix', create_using=nx.DiGraph())
T = nx.dfs_tree(G, source='0').reverse()
print([x for x in T])
nx.draw(G, with_labels=True)
This demonstrates the problem. You actually want to apply graph theory, the library networkx can help you with that. A first step would be to first construct the connection between each node, like I did in the example above. From there you can try to apply a logic to filter edges you don't want.
Not sure I fully understand the question you're asking however this is my attempt. It selects only if index > indent and mb ==B. In this case I'm just selecting the subset we want instead of dropping the subset we don't
import numpy as np
import pandas as pd
x=np.transpose(np.array([[0,1,2,3,4,5,6,7,8],[0,1,1,2,3,3,4,2,3],['M','B','M','B','B','M','M','B','M']]))
df=pd.DataFrame(x,columns=['Index','indent','MB'])
df1=df[(df['Index']>=df['indent']) & (df['MB']=='B')]
print(df1)

Plotly: How to fill table by rows using a pandas dataframe as source

I have this df: vertical_stack = pd.concat([eu_table, us_table], axis=0):
0 1 2 3 4 5 6 7 8 9 10 11
EU value 21 13 9 11 15 9 8 11 11 20 19 6
USA value 17 28 14 18 16 11 25 26 22 27 25 17
and I wanted to create a table in Plotly as above, by rows, but I can only fill the table vertically, where 'EU value' becomes a column and all its values are filled vertically. This is what I used:
fig3 = go.Figure(data=[go.Table(header=dict(values=['','Sept. 19', 'Oct. 19', 'Nov. 19', 'Dec. 19', 'Jan. 20', 'Feb. 20', 'Mar. 20',
'Apr. 20', 'May 20', 'Jun. 20', 'Jul. 20', 'Aug. 20']),
cells=dict(values=[vertical_stack])
)])
And this is my output:
As you can see all the values are filled in the same cell.
I'm not sure what't causing your error here, but I can show you how to build a table out of a pandas dataframe like this:
0 1 2 3 4 5 6 7 8 9 10 11
EU value 21 13 9 11 15 9 8 11 11 20 19 6
USA value 17 28 14 18 16 11 25 26 22 27 25 17
And since you've tagged your question plotly-dash but only used pure plotly in your example, I might as well provide a suggestion for both approaches in two separate parts.
Part 1 - Plotly and graph_objects
Using df = pd.read_clipboard(sep='\\s+') and df.to_dict() will let you build a reproducible dataframe like so:
df = pd.DataFrame({'0': {'EU value': 21, 'USA value': 17},
'1': {'EU value': 13, 'USA value': 28},
'2': {'EU value': 9, 'USA value': 14},
'3': {'EU value': 11, 'USA value': 18},
'4': {'EU value': 15, 'USA value': 16},
'5': {'EU value': 9, 'USA value': 11},
'6': {'EU value': 8, 'USA value': 25},
'7': {'EU value': 11, 'USA value': 26},
'8': {'EU value': 11, 'USA value': 22},
'9': {'EU value': 20, 'USA value': 27},
'10': {'EU value': 19, 'USA value': 25},
'11': {'EU value': 6, 'USA value': 17}})
And this data sample is a bit more practical to work with.
It shouldn't matter if you're adding data to your table row by row or column by column. The result should be the same for your example. If the following suggestions do not work for your actual use case, then just let me know and we can talk about options.
The snippet below will produce the following figure using go.Figure() and go.Table().
Table 1
Complete code for plotly
import plotly.graph_objects as go
import pandas as pd
df = pd.DataFrame({'0': {'EU value': 21, 'USA value': 17},
'1': {'EU value': 13, 'USA value': 28},
'2': {'EU value': 9, 'USA value': 14},
'3': {'EU value': 11, 'USA value': 18},
'4': {'EU value': 15, 'USA value': 16},
'5': {'EU value': 9, 'USA value': 11},
'6': {'EU value': 8, 'USA value': 25},
'7': {'EU value': 11, 'USA value': 26},
'8': {'EU value': 11, 'USA value': 22},
'9': {'EU value': 20, 'USA value': 27},
'10': {'EU value': 19, 'USA value': 25},
'11': {'EU value': 6, 'USA value': 17}})
df.reset_index(inplace=True)
df.rename(columns={'index':''}, inplace = True)
fig = go.Figure(data=[go.Table(
header=dict(values=list(df.columns),
fill_color='paleturquoise',
align='left'),
cells=dict(values=[df[col] for col in df.columns],
fill_color='lavender',
align='left'))
])
fig.show()
Part 2 - Plotly Dash using JupyterDash
The snippet below will produce the following table using, among other things:
dash_table.DataTable(
id='table',
columns=[{"name": i, "id": i} for i in df.columns],
data=df.to_dict('records')
Table 1
Complete code for JupyterDash
from jupyter_dash import JupyterDash
import dash_table
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/solar.csv')
df = pd.DataFrame({'0': {'EU_value': 21, 'USA_value': 17},
'1': {'EU_value': 13, 'USA_value': 28},
'2': {'EU_value': 9, 'USA_value': 14},
'3': {'EU_value': 11, 'USA_value': 18},
'4': {'EU_value': 15, 'USA_value': 16},
'5': {'EU_value': 9, 'USA_value': 11},
'6': {'EU_value': 8, 'USA_value': 25},
'7': {'EU_value': 11, 'USA_value': 26},
'8': {'EU_value': 11, 'USA_value': 22},
'9': {'EU_value': 20, 'USA_value': 27},
'10': {'EU_value': 19, 'USA_value': 25},
'11': {'EU_value': 6, 'USA_value': 17}})
df.reset_index(inplace=True)
df.rename(columns={'index':''}, inplace = True)
app = JupyterDash(__name__)
app.layout = dash_table.DataTable(
id='table',
columns=[{"name": i, "id": i} for i in df.columns],
data=df.to_dict('records'),
)
app.run_server(mode='inline', port = 8070, dev_tools_ui=True,
dev_tools_hot_reload =True, threaded=True)
This worked for me:
df = pd.read_csv(
'https://raw.githubusercontent.com/plotly/datasets/master/solar.csv')
# Plotly's simple table
fig = go.Figure(data=[go.Table(
header=dict(values=df.columns.tolist()),
cells=dict(values=df.to_numpy().T.tolist())
)])
Reference: https://plotly.com/python/table/

How to return key for column with smallest values

I have this dictionary:
d= {'1': { '2': 1, '3': 0, '4': 0, '5': 1, '6': 29 }
,'2': {'1': 13, '3': 1, '4': 0, '5': 21, '6': 0 }
,'3': {'1': 0, '2': 0, '4': 1, '5': 0, '6': 1 }
,'4': {'1': 1, '2': 17, '3': 1, '5': 2, '6': 0 }
,'5': {'1': 39, '2': 1, '3': 0, '4': 0, '6': 14 }
,'6': {'1': 0, '2': 0, '3': 43, '4': 1, '5': 0 }
}
I want to write a function that returns the column where all the values are <2 (less than 2).
So far i have turned the dictionary into a list, and then tried a lot of things that didn't work... I know that the answer is column number 4.
This was my latest attemp:
def findFirstRead(overlaps):
e= [[d[str(i)].get(str(j), '-') for j in range(1, 7)] for i in range(1, 7)]
nested_list = e
for i in map(itemgetter(x),nested_list):
if i<2:
return x+1
else:
continue
...and it was very wrong
The following set and list comprehension lists columns where the column has a max value of 2:
columns = {c for r, row in d.iteritems() for c in row}
[c for c in columns if max(v.get(c, -1) for v in d.itervalues()) < 2]
This returns ['4'].

Making a dictionary of overlaps from a dictionary

This problem is teasing me:
I have 6 different sequences that each overlap, they are name 1-6.
I have made a function that represents the sequences in a dictionary, and a function that gives me the part of the sequences that overlap.
Now i should use those 2 functions to construct a dictionary that take the number of overlapping positions in both the right-to-left order and in the left-to-right oder.
The dictionary I have made look like:
{'1': 'GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTCGTCCAGACCCCTAGC',
'2': 'CTTTACCCGGAAGAGCGGGACGCTGCCCTGCGCGATTCCAGGCTCCCCACGGG',
'3': 'GTCTTCAGTAGAAAATTGTTTTTTTCTTCCAAGAGGTCGGAGTCGTGAACACATCAGT',
'4': 'TGCGAGGGAAGTGAAGTATTTGACCCTTTACCCGGAAGAGCG',
'5': 'CGATTCCAGGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC',
'6': 'TGACAGTAGATCTCGTCCAGACCCCTAGCTGGTACGTCTTCAGTAGAAAATTGTTTTTTTCTTCCAAGAGGTCGGAGT'}
I should end up with a result like:
{'1': {'3': 0, '2': 1, '5': 1, '4': 0, '6': 29},
'3': {'1': 0, '2': 0, '5': 0, '4': 1, '6': 1},
'2': {'1': 13, '3': 1, '5': 21, '4': 0, '6': 0},
'5': {'1': 39, '3': 0, '2': 1, '4': 0, '6': 14},
'4': {'1': 1, '3': 1, '2': 17, '5': 2, '6': 0},
'6': {'1': 0, '3': 43, '2': 0, '5': 0, '4': 1}}
I seems impossible.
I guess it's not, so if somebody could (not do it) but push me in the right direction, it would be great.
This is a bit of a complicated one-liner, but it should work. Using find_overlaps() as the function that finds overlaps and seq_dict as the original dictionary of sequences:
overlaps = {seq:{other_seq:find_overlaps(seq_dict[seq],seq_dict[other_seq])
for other_seq in seq_dict if other_seq != seq} for seq in seq_dict}
Here it is with a bit nicer spacing:
overlaps = \
{seq:
{other_seq:
find_overlaps(seq_dict[seq],seq_dict[other_seq])
for other_seq in seq_dict if other_seq != seq}
for seq in seq_dict}
The clean way:
dna = {
'1': 'GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTCGTCCAGACCCCTAGC',
'2': 'CTTTACCCGGAAGAGCGGGACGCTGCCCTGCGCGATTCCAGGCTCCCCACGGG',
'3': 'GTCTTCAGTAGAAAATTGTTTTTTTCTTCCAAGAGGTCGGAGTCGTGAACACATCAGT',
'4': 'TGCGAGGGAAGTGAAGTATTTGACCCTTTACCCGGAAGAGCG',
'5': 'CGATTCCAGGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC',
'6': 'TGACAGTAGATCTCGTCCAGACCCCTAGCTGGTACGTCTTCAGTAGAAAATTG' \
'TTTTTTTCTTCCAAGAGGTCGGAGT'
}
def overlap(a, b):
l = min(len(a), len(b))
while True:
if a[-l:] == b[:l] or l == 0:
return l
l -= 1
def all_overlaps(d):
result = {}
for k1, v1 in d.items():
overlaps = {}
for k2, v2 in d.items():
if k1 == k2:
continue
overlaps[k2] = overlap(v1, v2)
result[k1] = overlaps
return result
print all_overlaps(dna)
(By the way, you could've provided overlap yourself in the question to make it easier for everyone to answer.)

Categories

Resources