Row comparison and append loop by columns

Row comparison and append loop by columns - python

I have a bunch of school data that I maintain on a master list for monthly testing scores. Everytime a child takes a score and there is an update on 'Age', 'Score', 'School' I would insert a new row with updated data and keep track of all the changes. I am trying to figure out a python script to do this but since I am a newbie, I keep running in to issues.
I tried writing a loop but keep getting errors to include "False", "The Truth value of a series is ambigious", "tuple indices must be integers, not str"
master_df = pd.DataFrame({'ID': ['A', 'B', 'C', 'D'],
'Age':[15,14,17,13],
'School':['AB', 'CD', 'EF', 'GH'],
'Score':[80, 75, 62, 100],
'Date': ['3/1/2019', '3/1/2019', '3/1/2019', '3/1/2019']})
updates_df = pd.DataFrame({'ID': ['A', 'B', 'C', 'D'],
'Age':[16,14,17,13],
'School':['AB', 'ZX', 'EF', 'GH'],
'Score':[80, 90, 62, 100],
'Date': ['4/1/2019', '4/1/2019', '4/1/2019', '4/1/2019']})
# What I am trying to get is:
updated_master = pd.DataFrame({'ID': ['A', 'A', 'B', 'B', 'C','D'],
'Age':[15,16,14,14,17,13],
'School':['AB', 'AB', 'CD', 'ZX', 'EF', 'GH'],
'Score':[80, 80, 75, 90, 62, 100],
'Date': ['3/1/2019', '4/1/2019', '3/1/2019', '4/1/2019', '3/1/2019', '3/1/2019']})
temp_delta_list = []
m_score = master_df.iloc[1:, master_df.columns.get_loc('Score')]
m_age = master_df.iloc[1:, master_df.columns.get_loc('Age')]
m_school = master_df.iloc[1:, master_df.columns.get_loc('School')]
u_score = updates_df.iloc[1:, updates_df.columns.get_loc('Score')]
u_age = updates_df.iloc[1:, updates_df.columns.get_loc('Age')]
u_school = updates_df.iloc[1:, updates_df.columns.get_loc('School')]
for i in updates_df['ID'].values:
updated_temp_score = updates_df[updates_df['ID'] == i], u_score
updated_temp_age = updates_df[updates_df['ID'] == i], u_age
updated_temp_school = updates_df[updates_df['ID'] == i], u_school
master_temp_score = master_df[master_df['ID'] == i], m_score
master_temp_age = master_df[master_df['ID'] == i], m_age
master_temp_school = updates_df[master_df['ID'] == i], m_school
if (updated_temp_score == master_temp_score) | (updated_temp_age == master_temp_age) | (updated_temp_school == master_temp_school):
pass
else:
temp_deltas = updates_df[(updates_df['ID'] == i)]
temp_delta_list.append(temp_deltas)
I ultimately want to have the loop compare each row values for each ID and return rows that have any difference and then append the master_df

Related

How to manipulate data from binance stream

I am trying to manipulate the following data from a websocket.
Here is the data:
{'e': 'kline', 'E': 1659440374345, 's': 'MATICUSDT', 'k': {'t': 1659440100000, 'T': 1659440399999, 's': 'MATICUSDT', 'i': '5m', 'f': 274454614, 'L': 274455188, 'o': '0.87210000', 'c': '0.87240000', 'h': '0.87240000', 'l': '0.87000000', 'v': '145806.50000000', 'n': 575, 'x': False, 'q': '127036.96453000', 'V': '76167.60000000', 'Q': '66365.16664000', 'B': '0'}}
I am trying to extract following: 'E', 's' AND 'c'. To manipulate to: 'E' = time, 's' = symbol and 'c' = PRICE
def createframe(msg):
df = pd.DataFrame([msg])
df = df.loc[:,['s','E','c']
df.columns = ['symbol', 'Time', 'Price']
df.Price = df.Price.astype(float)
df.Time = pd.to_datetime(df.Time, unit = 'ms')
return df
When I run the next line of code to pull data:
async with stream as receiver:
while True:
data = await receiver.recv()
data = json.loads(data)['data']
df = createframe(data)
print(df)
I am getting error that 'c' is not defined.
PLEASE HELP. THANK YOU

If you look at the data frame, you'll see that in column "k" you have a whole dictionary's worth of data. That's because the value of k is itself a dictionary. You're getting the error that c is not defined because it is not a column itself, just a piece of data in column "k".
In order to get all this data into individual columns, you'll have to "flatten" the data. You can do something like this:
def createframe(msg):
df = pd.DataFrame([msg])
df = df.loc[:,['s','E','c']]
df.columns = ['symbol', 'Time', 'Price']
df.Price = df.Price.astype(float)
df.Time = pd.to_datetime(df.Time, unit = 'ms')
return df
def flatten(data):
newdict = {}
for each in msg:
if isinstance(msg[each], dict):
for i in msg[each]:
newdict[i] = msg[each][i]
else:
newdict[each] = msg[each]
return newdict
async with stream as receiver:
while True:
data = await receiver.recv()
data = json.loads(data)['data']
data = flatten(data)
df = createframe(data)
print(df)
Hope this helps! If you have questions just comment on this answer.

how is the output of this nested loop being calculated?

Hi I have this calculation but I am failing to understand how this line [array([1050885., 1068309., 1085733., 1103157., 1120581.]) of the output is calculated, please explain.
creating sample data:
#creating sample data:
data1 = pd.DataFrame({"client": ['x1', 'x2'],
"cat": ['Bb', 'Ee'],
"amt": [1000,300],
"time":[2, 3],
"group":[10, 25]})
listc = ['Aa','Bb','Cc','Dd','Ee']
val1 = pd.DataFrame({'time': [1, 2, 3],
'lim %': [0.1, 0.11, 0.112]})
val2 = pd.concat([pd.DataFrame({'group':g, 'perc': 0.99, 'time':range(1, 11)}
for g in data1['group'].unique())]).explode('time')
mat = np.arange(75).reshape(3,5,5)
vals = [val1, val2]
data1['cat'] = pd.Categorical(data1['cat'],
categories=listc,
ordered=True).codes
for i in range(len(vals)):
if 'group' in vals[i].columns:
vals[i] = vals[i].set_index(['time', 'group'])
else:
vals[i] = vals[i].set_index(['time'])
#nested loop calculation
calc = {}
for client, cat, amt, start, group in data1.itertuples(name=None, index=False):
for time in range(start, len(mat)+1):
if time == start:
calc[client] = [[amt * mat[time-1, cat, :]]]
else:
calc[client].append([calc[client][-1][-1] # mat[time-1]])
for valcal in vals:
if isinstance(valcal.index, pd.MultiIndex):
value = valcal.loc[(time, group)].iat[0]
else:
value = valcal.loc[time].iat[0]
calc[client][-1].append(value * calc[client][-1][-1])
output:
{'x1': [[array([30000, 31000, 32000, 33000, 34000]),
array([3300., 3410., 3520., 3630., 3740.]),
array([3267. , 3375.9, 3484.8, 3593.7, 3702.6])],
[array([1050885., 1068309., 1085733., 1103157., 1120581.]), #how is this line calculated?
array([117699.12 , 119650.608, 121602.096, 123553.584, 125505.072]),
array([116522.1288 , 118454.10192, 120386.07504, 122318.04816,
124250.02128])]],
'x2': [[array([21000, 21300, 21600, 21900, 22200]),
array([2352. , 2385.6, 2419.2, 2452.8, 2486.4]),
array([2328.48 , 2361.744, 2395.008, 2428.272, 2461.536])]]}
what I need the calc for this line to be is:
[array([1050885., 1068309., 1085733., 1103157., 1120581.])
it should take array([3267. , 3375.9, 3484.8, 3593.7, 3702.6])] multiplied by mat at time 3, how can I get it to do this?

Output to screen and csv format python

I have a python nested dictionary output, I have been able to remove the first set of cruly brackets using RocketDict, but 1) I can't remove the second set of curly brackets 2)I tried to export it to a csv file giving the column names and that doesn't work because I can't figure out how to get the int#/# values that increment in the rows. For Example here was my initial output:
Before RocketDict:
{ intx/x : {'value1: 'A', 'value2: 'B', value3: 'C'},
inty/y : {'value1: 'X', 'value2: 'Y', value3: 'Z'}}
After the RocketDict:
intx/x : {'value1: 'A', 'value2: 'B', value3: 'C'},
inty/y : {'value1: 'X', 'value2: 'Y', value3: 'Z'}
Desired output:
intx/x : 'value1: 'A', 'value2: 'B', value3: 'C',
inty/y : 'value1: 'X', 'value2: 'Y', value3: 'Z'
Desired output to the csv:
Here is the full script:
results = requests.get(url, headers=headers)
inventory = results.json()
data = inventory['config']
class RocketDict(UserDict):
def __str__(self):
r = ['']
r.extend(['\t{} : {}'.format(k, v)
for k, v in self.items()])
return ',\n'.join(r)
if __name__ == '__main__':
#standard dict object
# inventory = {('key-%02d' % v): v for v in range(1, 10)}
# print(inventory, '\n')
# Wrap that dict object into a RocketDict.
d2 = RocketDict(data)
print(d2)
csv_columns = ['value1','value2','value3']
dict_data = d2
csv_file = 'mycsv.csv'
try:
with open(csv_file, 'w') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=csv_columns)
writer.writeheader()
for data in dict_data:
writer.writerow(d2)
except IOError:
print("I/O error")

Use pandas -
RocketDict ={ 'intx/x' : {'value1': 'A', 'value2': 'B', 'value3': 'C'},
'inty/y' : {'value1': 'X', 'value2': 'Y', 'value3': 'Z'}}
import pandas as pd
pd.DataFrame(RocketDict).transpose().to_csv('out.csv', index =True)

reportlab dynamic data-driven header outputs wrong subtitle

I have created some fictitious, though representative, clinical trial type data using Pandas, and now come to some test reporting in ReportLab.
The data has a block (~50 rows) where the treatment column is 'Placebo' and the same amount where the treatment is 'Active'. I simply want to list the data using a sub-heading of 'Treatment Group: Placebo' for the first set and 'Treatment Group: Active' for the second.
There are some hits on a similar topic, and, indeed I've used one of the suggested techniques, namely to extend the arguments of a header functions using partial from functools.
title1 = "ACME Corp CONFIDENTIAL"
title2 = "XYZ123 / Anti-Hypertensive Draft"
title3 = "Protocol XYZ123"
title4 = "Study XYZ123"
title5 = "Listing of Demographic Data by Treatment Arm"
title6 = "All subjects"
def title(canvas, doc, bytext):
canvas.saveState()
canvas.setFont(styleN.fontName, styleN.fontSize)
canvas.drawString(DOCMARGIN, PAGE_HEIGHT*.975, title1)
canvas.drawString(DOCMARGIN, PAGE_HEIGHT*.950, title2)
canvas.drawString(DOCMARGIN, PAGE_HEIGHT*.925, title3)
canvas.drawCentredString(PAGE_WIDTH/2.0, PAGE_HEIGHT*.900, title4)
canvas.drawCentredString(PAGE_WIDTH/2.0, PAGE_HEIGHT*.875, title5)
canvas.drawCentredString(PAGE_WIDTH/2.0, PAGE_HEIGHT*.850, title6)
canvas.drawString(DOCMARGIN, PAGE_HEIGHT*.825, "Treatment Group:" + bytext)
canvas.restoreState()
This is then called as follows. n_groups has the value of 2 from a summary query and 0 maps to 'Placebo' and 1 maps to active.
def build_pdf(doc):
ptemplates = []
for armcd in range(n_groups):
ptemplates.append(PageTemplate(id = 'PT' + str(armcd), frames = [dataFrame,],
onPage = partial(title, bytext=t_dict[armcd]),
onPageEnd = foot))
doc.addPageTemplates(ptemplates)
elements = []
for armcd in range(n_groups):
elements.append(NextPageTemplate('PT' + str(armcd)))
sublist = [t for t in lista if t[0] == (armcd+1)]
sublist.insert(0,colheads)
data_table = Table(sublist, 6*[40*mm], len(sublist)*[DATA_CELL_HEIGHT], repeatRows=1)
data_table.setStyle(styleC)
elements.append(data_table)
elements.append(PageBreak())
doc.build(elements)
The report produces 6 pages. The first 3 pages of placebo data are correct, pages 5 & 6 of active data are correct, but page 4 - which should be the first page of the second 'active' group has the sub-title 'Treatment Group: Placebo'.
I have re-organized the order of the statements multiple times, but can't get Page 4 to sub-title correctly. Any help, suggestions or magic would be much appreciated.
[Edit 1: sample data structure]
The 'top' of the data starts as:
[
[1, 'Placebo', '000001-000015', '1976-09-20', 33, 'F', 'Black'],
[1, 'Placebo', '000001-000030', '1959-04-26', 50, 'M', 'Asian'],
[1, 'Placebo', '000001-000031', '1946-02-07', 64, 'F', 'Asian'],
[1, 'Placebo', '000001-000046', '1947-11-08', 62, 'M', 'Asian'],
etc for 50 rows, then continues with
[2, 'Active', '000001-000002', '1962-02-28', 48, 'F', 'Black'],
[2, 'Active', '000001-000008', '1975-10-20', 34, 'M', 'Black'],
[2, 'Active', '000001-000013', '1959-01-19', 51, 'M', 'White'],
[2, 'Active', '000001-000022', '1962-01-12', 48, 'F', 'Black'],
[2, 'Active', '000001-000036', '1976-10-17', 33, 'F', 'Asian'],
[2, 'Active', '000001-000045', '1980-12-31', 29, 'F', 'White'],
for another 50.
The column header inserted is:
['Treatment Arm Code',
'Treatment Arm',
'Site ID - Subject ID',
'Date of Birth',
'Age (Years)',
'Gender',
'Ethnicity'],
[Edit 2: A solution - move the PageBreak() and make it conditional:]
def build_pdf(doc):
ptemplates = []
for armcd in range(n_groups):
ptemplates.append(PageTemplate(id = 'PT' + str(armcd), frames = [dataFrame,],
onPage = partial(title, bytext=t_dict[armcd]),
onPageEnd = foot))
doc.addPageTemplates(ptemplates)
elements = []
for armcd in range(n_groups):
elements.append(NextPageTemplate('PT' + str(armcd)))
if armcd > 0:
elements.append(PageBreak())
sublist = [t for t in lista if t[0] == (armcd+1)]
sublist.insert(0,colheads)
data_table = Table(sublist, 6*[40*mm], len(sublist)*[DATA_CELL_HEIGHT], repeatRows=1)
data_table.setStyle(styleC)
elements.append(data_table)
doc.build(elements)

Adding list in form of tuples to a dictionary

Assuming there is a list with sublists like this
[[2013, 'Patric', 'M', 1356], [2013, 'Helena', 'F', 202], [2013, 'Patric', 'F', 6],[1993, 'Patric', 'F', 7]......]
which is an output of def list_of_names() where 2013 is year, M is gender and 1356 is number of M births etc.
And I want to create a dictionary which outputs the name as a key and values as tuples (year, number_of_males,number_of_females) . So for example:
{ .. ’Patric’:[... , (1993, 0, 7), (2013, 1356, 6), ... ], ... }.
Technically 1993 is year, 0 is number of males and 7 is number of females and the tuples should be arranged in order of the years.
and I'm stuck on how to add this info into a dictionary
def name_Index(names):
d = dict()
L = readNames() #the list with from previous def which outputs different names and info as above
newlist = []
for sublist in L:

from collections import defaultdict
def list_of_names():
return [[2013, 'Patric', 'M', 1356],
[2013, 'Helena', 'F', 202],
[2013, 'Patric', 'F', 6],
[1993, 'Patric', 'F', 7]]
def name_Index():
tmp = defaultdict(lambda:defaultdict(lambda: [0,0]))
for year, name, sex, N in list_of_names():
i = 0 if sex == 'M' else 1
tmp[name][year][i] += N
d = {}
for name, entries in tmp.items():
d[name] = [(year, M, F) for (year, (M,F)) in entries.items()]
return d
print name_Index()

This was my attempt at the problem:
from collections import defaultdict, namedtuple
from itertools import groupby
data = [[2013, 'Patric', 'M', 1356],
[2013, 'Helena', 'F', 202],
[2013, 'Patric', 'F', 6],
[1993, 'Patric', 'F', 7]]
names = defaultdict(list)
datum = namedtuple('datum', 'year gender number')
for k, g in groupby(data, key=lambda x: x[1]):
for l in g:
year, name, gender, number = l
names[k].append(datum(year, gender, number))
final_dict = defaultdict(list)
for n in names:
for k, g in groupby(names[n], lambda x: x.year):
males = 0
females = 0
for l in g:
if l.gender == 'M':
males += l.number
elif l.gender == 'F':
females += l.number
final_dict[n].append((k, males, females))
print(final_dict)

The most convenient will be to use collections.defauldict. It returns dictionary-like object, that returns default value, if it doesn't find key. In your case, you use a list as default value, and in your loop you append tuples to it:
from collections import defaultdict
names = [ [2013, 'Patric', 'M', 1356],
[2013, 'Helena', 'F', 202],
[2013, 'Patric', 'F', 6],
[1993, 'Patric', 'F', 7] ]
def name_Index(data):
# name => year => sex
d = defaultdict(lambda: defaultdict(lambda: {'F': 0, 'M': 0}))
for year, name, sex, births in data:
d[name][year][sex] += births
# if you are fine with defauldict result: return d
# else collect results into tuples:
result = {}
for name, data in d.items():
result[name] = [(year, c['M'], c['F']) for year, c in data.items()]
return result
print name_Index(names)
# {'Helena': [(2013, 0, 202)], 'Patric': [(1993, 0, 7), (2013, 1356, 6)]}

I didn't understand why you are taking names as an argument of name_Index function and then calling readNames, there must be some necessity required for your work. Hence, i just put a dummy readNames function and sent None as argument to name_Index. Using class is a good technique to solve complicated data structures. Btw, nicely written question i must admit.
def readNames ():
return [[2013, 'Patric', 'M', 1356], [2013, 'Helena', 'F', 202], [2013, 'Patric', 'F', 6],[1993, 'Patric', 'F', 7]]
class YearOb(object):
def __init__(self):
self.male = 0
self.female = 0
def add_birth_data(self, gender, birth_count):
if gender == "M":
self.male += birth_count
else:
self.female += birth_count
class NameOb(object):
def __init__(self):
self.yearobs = dict()
def add_record(self, year, gender, birth_count):
if year not in self.yearobs:
self.yearobs[year]=YearOb()
self.yearobs[year].add_birth_data(gender, birth_count)
def get_as_list(self):
list_data = []
for year, yearob in self.yearobs.items():
list_data.append((year, yearob.male, yearob.female))
return list_data
def name_Index(names):
d = dict()
L = readNames() #the list with from previous def which outputs different names and info as above
newlist = []
for sublist in L:
name = sublist[1]
if name not in d:
d[name]=NameOb()
d[name].add_record(sublist[0], sublist[2], sublist[3])
for name, nameob in d.items():
d[name] = nameob.get_as_list()
return d
print(name_Index(None))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Row comparison and append loop by columns - python

Related

How to manipulate data from binance stream

how is the output of this nested loop being calculated?

Output to screen and csv format python

reportlab dynamic data-driven header outputs wrong subtitle

Adding list in form of tuples to a dictionary

Categories

Resources