I am struggling with python dictionaries. I created a dictionary that looks like:
d = {'0.500': ['18.4 0.5', '17.9 0.4', '16.9 0.4', '18.6 0.4'],
'1.000': ['14.8 0.5', '14.9 0.5', '15.6 0.4', '15.9 0.3'],
'0.000': ['23.2 0.5', '23.2 0.8', '23.2 0.7', '23.2 0.1']}
and I would like to end up having:
0.500 17.95 0.425
which is the key, average of (18.4+17.9+16.9+18.6), average of (0.5+0.4+0.4+0.4)
(and the same for 1.000 and 0.000 with their corresponding averages)
Initially my dictionary had only two values, so I could rely on indexes:
for key in d:
dvdl1 = d[key][0].split(" ")[0]
dvdl2 = d[key][1].split(" ")[0]
average = ((float(dvdl1)+float(dvdl2))/2)
but now I would like to have my code working for different dictionary lengths with lets say 4 (example above) or 5 or 6 values each...
Cheers!
for k,v in d.iteritems():
col1, col2 = zip(*[map(float,x.split()) for x in v])
print k, sum(col1)/len(v), sum(col2)/len(v)
...
0.500 17.95 0.425
1.000 15.3 0.425
0.000 23.2 0.525
How this works:
>>> v = ['18.4 0.5', '17.9 0.4', '16.9 0.4', '18.6 0.4']
first split each item at white-spaces and apply float to them, so we get a lists of lists:
>>> zipp = [map(float,x.split()) for x in v]
>>> zipp
[[18.4, 0.5], [17.9, 0.4], [16.9, 0.4], [18.6, 0.4]] #list of rows
Now we can use zip with * which acts as un-zipping and we will get a list of columns.
>>> zip(*zipp)
[(18.4, 17.9, 16.9, 18.6), (0.5, 0.4, 0.4, 0.4)]
Related
I'm trying to create a list of timestamps from a column in a dataframe, that resets after a certain time to zero. So, if the limit was 4, I want the count to add up the values of the column up to position 4, and then reset to zero, and continue adding the values of the column, from position 5, and so forth until it reaches the length of the column. I used itertools.islice earlier in the script to create a counter, so I was wondering if I could use a combination of this and itertools.count to do something similar? So far, this is my code:
cycle_time = list(itertools.islice(itertools.count(0,raw_data['Total Time (s)'][lens]),range(0, block_cycles),lens))
Where raw_data['Total Time (s)'] contains the values I wish to add up, block_cycles is the number I want to add up to in the dataframe column before resetting, and lens is the length of the column in the dataframe. Ideally, the output from my list would look like this:
print(cycle_time)
0
0.24
0.36
0.57
0
0.13
0.32
0.57
Which is calculated from this input:
print(raw_data['Total Time (s)'])
0
0.24
0.36
0.57
0.7
0.89
1.14
Which I would then append to a new column in a dataframe, interim_data_output['Cycle time (s)'] which details the time elapsed at that point in the 'cycle'. block_cycles is the number of iterations in each large 'cycle' This is what I would do with the list:
interim_data_output['Cycle time (s)'] = cycle_time
I'm a bit lost here, is this even possible using these methods? I'd like to use itertools for performance reasons. Any help would be greatly appreciated!
Given the discussion in the comments, here is an example:
df = pd.DataFrame({'Total Time (s)':[0, 0.24, 0.36, 0.57, 0.7, 0.89, 1.14]})
Total Time (s)
0 0.00
1 0.24
2 0.36
3 0.57
4 0.70
5 0.89
6 1.14
You can do:
block_cycles = 4
# Calculate cycle times.
cycle_times = df['Total Time (s)'].diff().fillna(0).groupby(df.index // block_cycles).cumsum()
# Insert the desired zeros after all cycles.
for idx in range(block_cycles, cycle_times.index.max(), block_cycles):
cycle_times.loc[idx-0.5] = 0
cycle_times = cycle_times.sort_index().reset_index(drop=True)
print(cycle_times)
Which gives:
0 0.00
1 0.24
2 0.36
3 0.57
4 0.00
5 0.13
6 0.32
7 0.57
Name: Total Time (s), dtype: float64
How can I make a bar chart in matplotlib (or pandas) from the bins in my dataframe?
I want something like this, below, where the x-axis labels come from the low, high in my dataframe (so first tick would read [-1.089, 0) and the y value is the percent column in my dataframe.
Here is an example dataset. The dataset is already in this format (I don't have an uncut version).
df = pd.DataFrame(
{
"low": [-1.089, 0, 0.3, 0.5, 0.6, 0.8],
"high": [0, 0.3, 0.5, 0.6, 0.8, 10.089],
"percent": [0.509, 0.11, 0.074, 0.038, 0.069, 0.202],
}
)
display(df)
Create a new column using the the low, high cols.
Covert the int values in the low and high columns to str type and set the new str in the [<low>, <high>) notation that you want.
From there, you can create a bar plot dirrectly from df using df.plot.bar(), assigning the newly created column as x and percent as y.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.bar.html
Recreate the bins using IntervalArray.from_arrays:
df['label'] = pd.arrays.IntervalArray.from_arrays(df.low, df.high)
# low high percent label
# 0 -1.089 0.000 0.509 (-1.089, 0.0]
# 1 0.000 0.300 0.110 (0.0, 0.3]
# 2 0.300 0.500 0.074 (0.3, 0.5]
# 3 0.500 0.600 0.038 (0.5, 0.6]
# 4 0.600 0.800 0.069 (0.6, 0.8]
# 5 0.800 10.089 0.202 (0.8, 10.089]
Then plot with x as these bins:
df.plot.bar(x='label', y='percent')
There are a dozens similar sounding questions here, I think I've searched them all and could not find a solution to my problem:
I have 2 df: df_c:
CAN-01 CAN-02 CAN-03
CE
ce1 0.84 0.73 0.50
ce2 0.06 0.13 0.05
And df_z:
CAN-01 CAN-02 CAN-03
marker
cell1 0.29 1.5 7
cell2 1.00 3.0 1
I want to join for each 'marker' + 'CE' combination over their column names
Example: cell1 + ce1:
[[0.29, 0.84],[1.5,0.73],[7,0.5], ...]
(Continuing for cell1 + ce2, cell2 + ce1, cell2 + ce2)
I have a working example using two loops and .loc twice, but it takes forever on the full data set.
I think the best to build is a multiindex DF with some merge/join/concat magic:
CAN-01 CAN-02 CAN-03
Source
0 CE 0.84 0.73 0.50
Marker 0.29 1.5 7
1 CE ...
Marker ...
Sample Code
dc = [['ce1', 0.84, 0.73, 0.5],['c2', 0.06,0.13,0.05]]
dat_c = pd.DataFrame(dc, columns=['CE', 'CAN-01', 'CAN-02', 'CAN-03'])
dat_c.set_index('CE',inplace=True)
dz = [['cell1', 0.29, 1.5, 7],['cell2', 1, 3, 1]]
dat_z = pd.DataFrame(dz, columns=['marker', "CAN-01", "CAN-02", "CAN-03"])
dat_z.set_index('marker',inplace=True)
Bad/Slow Solution
for ci, c_row in dat_c.iterrows(): # for each CE in CE table
tmp = []
for j,colz in enumerate(dat_z.columns[1:]):
if not colz in dat_c:
continue
entry_c = c_row.loc[colz]
if len(entry_c.shape) > 0:
continue
tmp.append([dat_z.loc[marker,colz],entry_c])
IIUC:
use append()+groupby():
dat_c.index=[f"cell{x+1}" for x in range(len(dat_c))]
df=dat_c.append(dat_z).groupby(level=0).agg(list)
output of df:
CAN-01 CAN-02 CAN-03
cell1 [0.84, 0.29] [0.73, 1.5] [0.5, 7.0]
cell2 [0.06, 1.0] [0.13, 3.0] [0.05, 1.0]
If needed list:
dat_c.index=[f"cell{x+1}" for x in range(len(dat_c))]
lst=dat_c.append(dat_z).groupby(level=0).agg(list).to_numpy().tolist()
output of lst:
[[[0.84, 0.29], [0.73, 1.5], [0.5, 7.0]],
[[0.06, 1.0], [0.13, 3.0], [0.05, 1.0]]]
I'd like to find an efficient way to use the df.groupby() function in pandas to return both the means and standard deviations of a data frame - preferably in one shot!
import pandas as PD
df = pd.DataFrame({'case':[1, 1, 2, 2, 3, 3],
'condition':[1,2,1,2,1,2],
'var_a':[0.92, 0.88, 0.90, 0.79, 0.94, 0.85],
'var_b':[0.21, 0.15, 0.1, 0.16, 0.17, 0.23]})
with that data, I'd like an easier way (if there is one!) to perform the following:
grp_means = df.groupby('case', as_index=False).mean()
grp_sems = df.groupby('case', as_index=False).sem()
grp_means.rename(columns={'var_a':'var_a_mean', 'var_b':'var_b_mean'},
inplace=True)
grp_sems.rename(columns={'var_a':'var_a_SEM', 'var_b':'var_b_SEM'},
inplace=True)
grouped = pd.concat([grp_means, grp_sems[['var_a_SEM', 'var_b_SEM']]], axis=1)
grouped
Out[1]:
case condition var_a_mean var_b_mean var_a_SEM var_b_SEM
0 1 1.5 0.900 0.18 0.900 0.18
1 2 1.5 0.845 0.13 0.845 0.13
2 3 1.5 0.895 0.20 0.895 0.20
I also recently learned of the .agg() function, and tried df.groupby('grouper column') agg('var':'mean', 'var':sem') but this just returns a SyntaxError.
I think need DataFrameGroupBy.agg, but then remove column ('condition','sem') and map for convert MultiIndex to columns:
df = df.groupby('case').agg(['mean','sem']).drop(('condition','sem'), axis=1)
df.columns = df.columns.map('_'.join)
df = df.reset_index()
print (df)
case condition_mean var_a_mean var_a_sem var_b_mean var_b_sem
0 1 1.5 0.900 0.020 0.18 0.03
1 2 1.5 0.845 0.055 0.13 0.03
2 3 1.5 0.895 0.045 0.20 0.03
How can I create from this (two columns, fixed width):
0.35 23.8
0.39 23.7
0.43 23.6
0.47 23.4
0.49 23.1
0.51 22.8
0.53 22.4
0.55 21.6
Two lists:
list1 = [0.35, 0.39, 0.43, ...]
list2 = [23.8, 23.7, 23.6, ...]
Thank you.
Probably you are looking for something like this
>>> str1 = """0.35 23.8
0.39 23.7
0.43 23.6
0.47 23.4
0.49 23.1
0.51 22.8
0.53 22.4
0.55 21.6"""
>>> zip(*(e.split() for e in str1.splitlines()))
[('0.35', '0.39', '0.43', '0.47', '0.49', '0.51', '0.53', '0.55'), ('23.8', '23.7', '23.6', '23.4', '23.1', '22.8', '22.4', '21.6')]
You can easily extend the above solution to cater to any type of iterables including file
>>> with open("test1.txt") as fin:
print zip(*(e.split() for e in fin))
[('0.35', '0.39', '0.43', '0.47', '0.49', '0.51', '0.53', '0.55'), ('23.8', '23.7', '23.6', '23.4', '23.1', '22.8', '22.4', '21.6')]
Instead of strings if you want the numbers as floats, you need to pass it through the float function possibly by map
>>> zip(*(map(float, e.split()) for e in str1.splitlines()))
[(0.35, 0.39, 0.43, 0.47, 0.49, 0.51, 0.53, 0.55), (23.8, 23.7, 23.6, 23.4, 23.1, 22.8, 22.4, 21.6)]
And finally to unpack it to two separate lists
>>> from itertools import izip
>>> column_tuples = izip(*(map(float, e.split()) for e in str1.splitlines()))
>>> list1, list2 = map(list, column_tuples)
>>> list1
[0.35, 0.39, 0.43, 0.47, 0.49, 0.51, 0.53, 0.55]
>>> list2
[23.8, 23.7, 23.6, 23.4, 23.1, 22.8, 22.4, 21.6]
So how it works
zip takes a list of iterables and returns a list of pair wise tuple for each iterator. itertools.izip is similar but instead of returning a list of pairwise tuples, it returns an iterator of pairwise tuples. This would be more memory friendly
map applied a function to each element of the iterator. So map(float, e.split) would convert the strings to floats. Note an alternate way of representing maps is through LC or generator expression
Finally str.splitlines converts the newline separated string to a list of individual lines.
Try this:
splitted = columns.split()
list1 = splitted[::2] #column 1
list2 = splitted[1::2] #column 2