Create column from multiple dataframes - python

I need to create some new columns based on the value of a dataframe filed and a look up dataframe with some rates.
Having df1 as
zone hh hhind
0 14 112.0 3.4
1 15 5.0 4.4
2 16 0.0 1.0
and a look_up df as
ind per1 per2 per3 per4
0 1.0 1.000 0.000 0.000 0.000
24 3.4 0.145 0.233 0.165 0.457
34 4.4 0.060 0.114 0.075 0.751
how can i update df1.hh1 by multiplying the look_up.per1 based on df1.hhind and lookup.ind
zone hh hhind hh1
0 14 112.0 3.4 16.240
1 15 5.0 4.4 0.300
2 16 0.0 1.0 0.000
at the moment im getting the result by merging the tables and the doing the arithmetic.
r = pd.merge(df1, look_up, left_on="hhind", right_on="ind")
r["hh1"] = r.hh *r.per1
i'd like to know if there is a more straight way to accomplish this by not merging the tables?

You could first set hhind and ind as the index axis of df1 and look_up dataframes respectively. Then, multiply corresponding elements in hh and per1 element-wise.
Map these results to the column hhind and assign these to a new column later as shown:
mapper = df1.set_index('hhind')['hh'].mul(look_up.set_index('ind')['per1'])
df1.assign(hh1=df1['hhind'].map(mapper))

another solution:
df1['hh1'] = (df1['hhind'].map(lambda x: look_up[look_up["ind"]==x]["per1"])) * df1['hh']

Related

How to introduce missing values in time series data

I'm new to python and also new to this site. My colleague and I are working on a time series dataset. we wish to introduce some missing values to the dataset and then use some techniques to fill in the missing values to see how well those techniques perform for the data imputation task. The challenge we have at the moment is how to introduce missing values to the dataset in a consecutive manner and not just randomly. For example, we want to replace data for a period of time with NaNs, eg, 3 consecutive days. I will really appreciate if anyone can point us in the right direction on how to get this done. we are working with python.
Here is my sample data
There is a method for filling NaNs
dataframe['name_of_column'].fillna('value')
See set_missing_data function below:
import numpy as np
np.set_printoptions(precision=3, linewidth=1000)
def set_missing_data(data, missing_locations, missing_length):
for i in missing_locations:
data[i:i+missing_length] = np.nan
np.random.seed(0)
n_data_points = np.random.randint(40, 50)
data = np.random.normal(size=[n_data_points])
n_missing = np.random.randint(3, 6)
missing_length = 3
missing_locations = np.random.choice(
n_data_points - missing_length,
size=n_missing,
replace=False
)
print(data)
set_missing_data(data, missing_locations, missing_length)
print(data)
Console output:
[ 0.118 0.114 0.37 1.041 -1.517 -0.866 -0.055 -0.107 1.365 -0.098 -2.426 -0.453 -0.471 0.973 -1.278 1.437 -0.078 1.09 0.097 1.419 1.168 0.947 1.085 2.382 -0.406 0.266 -1.356 -0.114 -0.844 0.706 -0.399 -0.827 -0.416 -0.525 0.813 -0.229 2.162 -0.957 0.067 0.206 -0.457 -1.06 0.615 1.43 -0.212]
[ 0.118 nan nan nan -1.517 -0.866 -0.055 -0.107 nan nan nan -0.453 -0.471 0.973 -1.278 1.437 -0.078 1.09 0.097 nan nan nan 1.085 2.382 -0.406 0.266 -1.356 -0.114 -0.844 0.706 -0.399 -0.827 -0.416 -0.525 0.813 -0.229 2.162 -0.957 0.067 0.206 -0.457 -1.06 0.615 1.43 -0.212]

Select value from dataframe based on other dataframe

i try to calculate the position of an object based on a timestamp. For this I have two dataframes in pandas. One for the measurement data and one for the position. All the movement is a straightforward acceleration.
Dataframe 1 contains the measurement data:
ms force ... ... ...
1 5 20
2 10 20
3 15 25
4 20 30
5 25 20
..... (~ 6000 lines)
Dataframe 2 contains "positioning data"
ms speed (m/s)
1 0 0.66
2 4500 0.66
3 8000 1.3
4 16000 3.0
5 20000 3.0
.....(~300 lines)
Now I want to calculate the position of the first dataframe with the data from secound dataframe
In Excel I solved the problem by using an array formular but now I have to use Python/Pandas and I cant find a way how to select the correct row from dataframe 2.
My idea is to make something like this: if
In the end I want to display a graph "force <-> way" and not "force <-> time"
Thank you in andvance
==========================================================================
Update:
In the meantime I could almost solve my issue. Now my Data look like this:
Dataframe 2 (Speed Data):
pos v a t t-end t-start
0 -3.000 0.666667 0.000000 4.500000 4.500000 0.000000
1 0.000 0.666667 0.187037 0.071287 4.571287 4.500000
2 0.048 0.680000 0.650794 0.010244 4.581531 4.571287
3 0.055 0.686667 0.205432 0.064904 4.646435 4.581531
...
15 0.055 0.686667 0.5 0.064904 23.0 20.0
...
28 0.055 0.686667 0.6 0.064904 35.0 34.0
...
30 0.055 0.686667 0.9 0.064904 44.0 39.0
And Dataframe 1 (time based measurement):
Fx Fy Fz abs_t expected output ('a' from DF1)
0 -13.9 170.3 45.0 0.005 0.000000
1 -14.1 151.6 38.2 0.010 0.000000
...
200 -14.1 131.4 30.4 20.015 0.5
...
300 -14.3 111.9 21.1 34.01 0.6
...
400 -14.5 95.6 13.2 40.025
So i want to check the time(abs_t) from DF1 and search for the corract 'a' in DF2
So somthing like this (pseudo code):
if (DF1['t_abs'] between (DF2['t-start'], DF2['t-end']):
DF1['a'] = DF2['a']
I could make two for loops but it looks like the wrong way and is very very slow.
I hope you understand my problem; to provide a running sample is very hard.
In Excel I did like this:
I found a very slow solution but atleast its working :(
df1['a'] = 0
for index, row in df2.iterrows():
start = row['t-start']
end = row ['t-end']
a = row ['a']
df1.loc[(df1['tabs']>start)&(df1['tabs']<end), 'a'] = a

Pandas: rolling windows with a sum product

There are a number of answers that each provide me with a portion of my desired result, but I am challenged putting them all together. My core Pandas data frame looks like this, where I am trying to estimate volume_step_1:
date volume_step_0 volume_step_1
2018-01-01 100 a
2018-01-02 101 b
2018-01-03 105 c
2018-01-04 123 d
2018-01-05 121 e
I then have a reference table with the conversion rates, for e.g.
step conversion
0 0.60
1 0.81
2 0.18
3 0.99
4 0.75
I have another table containing point estimates of a Poisson distribution:
days_to_complete step_no pc_cases
0 0 0.50
1 0 0.40
2 0 0.07
Using these data, I now want to estimate
volume_step_1 =
(volume_step_0(today) * days_to_complete(step0, day0) * conversion(step0)) +
(volume_step_0(yesterday) * days_to_complete(step0,day1) * conversion(step0))
and so forth.
How do I write some Python code to do so?
Calling your dataframes (from top to bottom as df1, df2, and df3):
df1['volume_step_1'] = (
(df1['volume_step_0']*
df2.loc[(df2['days_to_complete'] == 0) & (df2['step_no'] == 0), 'pc_cases']*
df3.loc[df3['step'] == 0, 'conversion']) +
df1['volume_step_0'].shift(1)*
df2.loc[(df2['days_to_complete'] == 1) & (df2['step_no'] == 0), 'pc_cases']*
df3.loc[df3['step'] == 0, 'conversion'])
EDIT:
IIUC, you are trying to get a 'dot product' of sorts between the volume_step_0 column and the product of the pc_cases and conversionfor a particular step_no. You can merge df2 and df3 to match steps:
df_merged = df_merged = df2.merge(df3, how = 'left', left_on = 'step', right_on = 'step_no')
df_merged.head(3)
step conversion days_to_complete step_no pc_cases
0 0.0 0.6 0.0 0.0 0.50
1 0.0 0.6 1.0 0.0 0.40
2 0.0 0.6 2.0 0.0 0.07
I'm guessing you're only using stepk to get volume_step_k+1, and you want to iterate the sum over the days. The following code generates a vector of days_to_complete(step0, dayk) and conversion(step0) for all values of k that are available in days_to_complete, and finds their product:
df_fin = df_merged[df_merged['step'] == 0][['conversion', 'pc_cases']].product(axis = 1)
0 0.300
1 0.240
2 0.042
df_fin = df_fin[::-1].reset_index(drop = True)
Finally, you want to take the dot product of the days_to_complete * conversion vector by the volume_step_0 vector, for a rolling window (as many values exist in days_to_complete):
vol_step_1 = pd.Series([df1['volume_step_0'][i:i+len(df3)].reset_index(drop = True).dot(df_fin) for i in range(0,len(df3))])
df1['volume_step_1'] = df1['volume_step_1'][::-1].reset_index(drop = True)
Output:
df1
date volume_step_0 volume_step_1
0 2018-01-01 100 NaN
1 2018-01-02 101 NaN
2 2018-01-03 105 70.230
3 2018-01-04 123 66.342
4 2018-01-05 121 59.940
While this is by no means a comprehensive solution, the code is meant to provide the logic to "sum multiple products", as you had asked.

How can I read in from two files,insert new columns, and compute functions like mean if there are blank values?

I have this file called 'test.txt' and it looks like this:
3.H5 5.40077
2.H8 7.75894
3.H6 7.60437
3.H5 5.40001
5.H5 5.70502
4.H8 7.55438
5.H1' 5.43574
5.H6 7.96472
""
""
""
""
""
""
6.H6 7.96178
6.H5 5.71068
""
""
7.H8 8.29385
7.H1' 6.01136
""
""
""
""
8.H5 5.51053
8.H6 7.67437
I want to see if the values in the first column are the same (i.e.: if 8.H5 occurs more than once), and if they are, I want to count how many times and take their average. I want my output to look like this:
Atom nVa predppm avgppm stdev delta QPred QMulti qTotal
1.H1' 1 5.820 5.737 0.000 0.000 0.985 1.000 0.995
2.H1' 1 5.903 5.892 0.000 0.000 0.998 1.000 0.999
3.H1' 1 5.549 5.454 0.000 0.000 0.983 1.000 0.994
4.H1' 1 5.741 5.737 0.000 0.000 0.999 1.000 1.000
6.H1' 1 5.543 5.600 0.000 0.000 0.990 1.000 0.997
8.H1' 1 5.363 5.359 0.000 0.000 0.999 1.000 1.000
10.H1' 1 5.378 5.408 0.000 0.000 0.995 1.000 0.998
11.H1' 1 5.501 5.497 0.000 0.000 0.999 1.000 1.000
14.H1' 1 5.962 5.893 0.000 0.000 0.988 1.000 0.996
Right now, my code reads from test.txt and computes the count and the mean of the values and gives an output which looks like this (output.txt):
Atom nVa avgppm
1.H1' 1 5.737
2.H1' 1 5.892
3.H1' 1 5.454
4.H1' 1 5.737
6.H1' 1 5.600
But it does not account for the "" rows, how can I get my code to skip lines that have ""?
I also have a file called test2.txt which looks like this:
5.H6 7.72158 0.3
6.H6 7.70272 0.3
7.H8 8.16859 0.3
8.H6 7.65014 0.3
9.H8 8.1053 0.3
10.H6 7.5231 0.3
12.H6 7.72805 0.3
13.H6 8.02977 0.3
14.H6 7.69624 0.3
17.H8 7.24899 0.3
16.H8 8.27957 0.3
18.H6 7.6439 0.3
19.H8 7.65501 0.3
20.H8 7.78512 0.3
21.H8 8.06057 0.3
22.H8 7.47677 0.3
23.H6 7.7306 0.3
24.H6 7.80104 0.3
I want to read in values from the first column of test.txt and values from the first column in test2.txt and see if they are the same (i.e.: if 20.H8 = 20.H8) and if they are, I want to insert a column in my output.txt between the nVa column and the avgppm column, and put in the values from test2.txt. How can I insert a column into this output file which also accounts for the blank spaces, by not using those lines?
This is my current code:
import pandas as pd
import os
import sys
test = 'test.txt'
test2 = 'test2.txt'
df = pd.read_csv(test, sep = ' ', header = None)
df.columns = ["Atom","ppm"]
gb = (df.groupby("Atom", as_index=False)
.agg({"ppm":["count","mean"]})
.rename(columns={"count":"nVa", "mean":"avgppm"}))
gb.head()
gb.columns = gb.columns.droplevel()
gb = gb.rename(columns={"":"Atom"})
gb.to_csv("output.txt", sep =" ", index=False)
df2 = pd.read_csv(test2, sep = r'/s+', header = None)
df2.columns = ["Atoms","ppms","error"]
shift1 = df2["Atoms"]
shift2 = df2["ppms"]
I'm not exactly sure how to proceed.
To drop the row with "" as the values, use the dropna method of the data frame. You can follow this by reset_index to reset the row counts
df = pd.read_csv(test, sep = ' ', header = None)
df.columns = ["Atom","ppm"]
df = df.dropna().reset_index(drop=True)
gb = ...
To find matching values, you can use merge method and compare the columns of interest.
df2 = pd.read_csv(test2, sep = r'/s+', header = None)
df2.columns = ["Atoms","ppms","error"]
gb.merge(df2, left_on='Atom', right_on='Atoms', how='left').drop(['Atoms','ppms'], axis=1)
This will leave you with NA values if the value in gb is not in df2.
A left merge() should be able to bring df and df2 together the way you want.
df = pd.read_csv("test.txt", sep=" ", header=None, names=["Atom", "ppm"])
df2 = pd.read_csv("test2.txt", sep=" ", header=None, names=["Atom", "ppms", "error"])
gb = df.groupby("Atom").agg(["count", "mean"])
gb.merge(df2.set_index("Atom"), how="left", left_index=True, right_index=True)
(ppm, count) (ppm, mean) ppms error
Atom
2.H8 1 7.75894 NaN NaN
3.H5 2 5.40039 NaN NaN
3.H6 1 7.60437 NaN NaN
4.H8 1 7.55438 NaN NaN
5.H1' 1 5.43574 NaN NaN
5.H5 1 5.70502 NaN NaN
5.H6 1 7.96472 7.72158 0.3
6.H5 1 5.71068 NaN NaN
6.H6 1 7.96178 7.70272 0.3
7.H1' 1 6.01136 NaN NaN
7.H8 1 8.29385 8.16859 0.3
8.H5 1 5.51053 NaN NaN
8.H6 1 7.67437 7.65014 0.3
Note: It doesn't seem that you even need dropna() for the missing rows in df. read_csv() interprets the "" values as NaN, and groupby() ignores NaN when grouping.

Pandas mean() for multiindex

I have df:
CU Parameters 1 2 3
379-H Output Energy, (Wh/h) 0.045 0.055 0.042
349-J Output Energy, (Wh/h) 0.001 0.003 0
625-H Output Energy, (Wh/h) 2.695 1.224 1.272
626-F Output Energy, (Wh/h) 1.381 1.494 1.3
I would like to create two separate dfs, getting the mean of column values by grouping index on level 0 (CU):
df1: (379-H and 625-H)
Parameters 1 2 3
Output Energy, (Wh/h) 1.37 0.63 0.657
df2: (the rest)
Parameters 1 2 3
Output Energy, (Wh/h) 0.69 0.74 0.65
I can get the mean for all using by grouping level 1:
df = df.apply(pd.to_numeric, errors='coerce').dropna(how='all').groupby(level=1).mean()
but how do I group these according to level 0?
SOLUTION:
lightsonly = ["379-H", "625-H"]
df = df.apply(pd.to_numeric, errors='coerce').dropna(how='all')
mask = df.index.get_level_values(0).isin(lightsonly)
df1 = df[mask].groupby(level=1).mean()
df2 = df[~mask].groupby(level=1).mean()
Use get_level_values + isin for True and False index and then get mean with rename by dict:
d = {True: '379-H and 625-H', False: 'the rest'}
df.index = df.index.get_level_values(0).isin(['379-H', '625-H'])
df = df.mean(level=0).rename(d)
print (df)
1 2 3
the rest 0.691 0.7485 0.650
379-H and 625-H 1.370 0.6395 0.657
For separately dfs is possible also use boolean indexing:
mask= df.index.get_level_values(0).isin(['379-H', '625-H'])
df1 = df[mask].mean().rename('379-H and 625-H').to_frame().T
print (df1)
1 2 3
379-H and 625-H 1.37 0.6395 0.657
df2 = df[~mask].mean().rename('the rest').to_frame().T
print (df2)
1 2 3
the rest 0.691 0.7485 0.65
Another numpy solution with DataFrame constructor:
a1 = df[mask].values.mean(axis=0)
#alternatively
#a1 = df.values[mask].mean(axis=0)
df1 = pd.DataFrame(a1.reshape(-1, len(a1)), index=['379-H and 625-H'], columns=df.columns)
print (df1)
1 2 3
379-H and 625-H 1.37 0.6395 0.657
Consider the dataframe df where CU and Parameters are assumed to be in the index.
1 2 3
CU Parameters
379-H Output Energy, (Wh/h) 0.045 0.055 0.042
349-J Output Energy, (Wh/h) 0.001 0.003 0.000
625-H Output Energy, (Wh/h) 2.695 1.224 1.272
626-F Output Energy, (Wh/h) 1.381 1.494 1.300
Then we can groupby the truth values of whether the first level values are in the list ['379-H', '625-H'].
m = {True: 'Main', False: 'Rest'}
l = ['379-H', '625-H']
g = df.index.get_level_values('CU').isin(l)
df.groupby(g).mean().rename(index=m)
1 2 3
Rest 0.691 0.7485 0.650
Main 1.370 0.6395 0.657
#Use a lambda function to change index to 2 groups and then groupby using the modified index.
df.groupby(by=lambda x:'379-H,625-H' if x[0] in ['379-H','625-H'] else 'Others').mean()
Out[22]:
1 2 3
379-H,625-H 1.370 0.6395 0.657
Others 0.691 0.7485 0.650

Categories

Resources