Averaging values between files, but keeping non-matching values

Averaging values between files, but keeping non-matching values - python

I have two files:
File 1:
key.1 10 6
key.2 5 6
key.3. 5 8
key.4. 5 10
key.5 4 12
File 2:
key.1 10 6
key.2 6 6
key.4 5 10
key.5 2 8
I have a rather complicated issue. I want to average between the two files for each loc. ID. But if an ID is unique to either of the files, I simply want to keep that value in the output file. So the output file would look like this:
key.1 10 6
key.2 5.5 6
key.3. 5 8
key.4. 5 10
key.5 3 10
This is an example. In reality I have 100s of columns that I would like to average.

The following solution uses Pandas, and assumes that your data is stored in plain text files 'file1.txt' and 'file2.txt'. Let me know if this assumption is incorrect - it is likely a minimal edit to alter for different file types. If I have misunderstood your meaning of the word 'file' and your data is already in DataFrames, you can ignore the first step.
First read in the data to DataFrames:
import pandas as pd
df1 = pd.read_table('file1.txt', sep=r'\s+', header=None)
df2 = pd.read_table('file2.txt', sep=r'\s+', header=None)
Giving us:
In [9]: df1
Out[9]:
0 1 2
0 key.1 10 6
1 key.2 5 6
2 key.3 5 8
3 key.4 5 10
4 key.5 4 12
In [10]: df2
Out[10]:
0 1 2
0 key.1 10 6
1 key.2 6 6
2 key.4 5 10
3 key.5 2 8
Then join these datasets on column 0:
combined = pd.merge(df1, df2, 'outer', on=0)
Giving:
0 1_x 2_x 1_y 2_y
0 key.1 10 6 10.0 6.0
1 key.2 5 6 6.0 6.0
2 key.3 5 8 NaN NaN
3 key.4 5 10 5.0 10.0
4 key.5 4 12 2.0 8.0
Which is a bit of a mess, but we can select only the columns we want after doing the calculations:
combined[1] = combined[['1_x', '1_y']].mean(axis=1)
combined[2] = combined[['2_x', '2_y']].mean(axis=1)
Selecting only useful columns:
results = combined[[0, 1, 2]]
Which gives us:
0 1 2
0 key.1 10.0 6.0
1 key.2 5.5 6.0
2 key.3 5.0 8.0
3 key.4 5.0 10.0
4 key.5 3.0 10.0
Which is what you were looking for I believe.
You didn't state which file format you wanted the output to be, but the following will give you a tab-separated text file. Let me know if something different is preferred and I can edit.
results.to_csv('output.txt', sep='\t', header=None, index=False)
I should add that it would be better to give your columns relevant labels rather than using numbers as I have in this example - I just used the default integer values here since I don't know anything about your dataset.

This is one solution via pandas. The idea is to define indices for each dataframe and use ^ [equivalent to symmetric_difference in set terminology] to find your unique indices.
Treat each case separately via 2 pd.concat calls, perform a groupby.mean, and append your isolated indices at the end.
# read files into dataframes
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file2.csv')
# set first column as index
df1 = df1.set_index(0)
df2 = df2.set_index(0)
# calculate symmetric difference of indices
x = df1.index ^ df2.index
# Index(['key.3'], dtype='object', name=0)
# aggregate common and unique indices
df_common = pd.concat((df1[~df1.index.isin(x)], df2[~df2.index.isin(x)]))
df_unique = pd.concat((df1[df1.index.isin(x)], df2[df2.index.isin(x)]))
# calculate mean on common indices; append unique indices
mean = df_common.groupby(df_common.index)\
.mean()\
.append(df_unique)\
.sort_index()\
.reset_index()
# output to csv
mean.to_csv('out.csv', index=False)
Result
0 1 2
0 key.1 10.0 6.0
1 key.2 5.5 6.0
2 key.3 5.0 8.0
3 key.4 5.0 10.0
4 key.5 3.0 10.0

You can use itertools.groupby:
import itertools
import re
file_1 = [[re.sub('\.$', '', a), *list(map(int, filter(None, b)))] for a, *b in [re.split('\s+', i.strip('\n')) for i in open('filename.txt')]]
file_2 = [[re.sub('\.$', '', a), *list(map(int, filter(None, b)))] for a, *b in [re.split('\s+', i.strip('\n')) for i in open('filename1.txt')]]
special_keys ={a for a, *_ in [re.split('\s+', i.strip('\n')) for i in open('filename.txt')]+[re.split('\s+', i.strip('\n')) for i in open('filename2.txt')] if a.endswith('.')}
new_results = [[a, [c for _, *c in b]] for a, b in itertools.groupby(sorted(file_1+file_2, key=lambda x:x[0])[1:], key=lambda x:x[0])]
last_results = [(" "*4).join(["{}"]*3).format(a+'.' if a+'.' in special_keys else a, *[sum(i)/float(len(i)) for i in zip(*b)]) for a, b in new_results]
Output:
['key.1 10.0 6.0', 'key.2 5.5 6.0', 'key.3. 5.0 8.0', 'key.4. 5.0 10.0', 'key.5 3.0 10.0']

One possible solution is to read the two files into dictionaries (key being the key variable, and the value being a list with the two elements after). You can then get the keys of each dictionary, see which keys are duplicated (and if so, average the results), and which keys are unique (and if so, just output the key). This might not be the most efficient, but if you only have hundreds of columns that should be the simplest way to do it.
Look up set intersection and set difference, as they will help you find the common items and unique items.

Related

How do I add a column to a dataframe based on values from other columns?

I have a dataframe and I would like to add a column based on the values of the other columns
If the problem were only that, I think a good solution would be this answer
However my problem is a bit more complicated
Say I have
import pandas as pd
a= pd.DataFrame([[5,6],[1,2],[3,6],[4,1]],columns=['a','b'])
print(a)
I have
a b
0 5 6
1 1 2
2 3 6
3 4 1
Now I want to add a column called 'result' where each of the values would be the result of applying this function
def process(a,b,c,d):
return {"notthisone":2*a,
"thisone":(a*b+c*d),
}
to each of the rows and the next rows of the dataframe
This function is part of a library, it outputs two values but we are only interested in the values of the key thisone
Also, if possible we can not decompose the operations of the function but we have to apply it to the values
For example in the first row
a=5,b=6,c=1,d=2 (c and d being the a and b of the next rows) and we want to add the value "thisone" so 5*6+1*2=32
In the end I will have
a b result
0 5 6 32
1 1 2 20
2 3 6 22
3 4 1 22 --> This is an special case since there is no next row so just a repeat of the previous would be fine
How can I do this?
I am thinking of traversing the dataframe with a loop but there must be a better and faster way...
EDIT:
I have done this so far
def p4(a,b):
return {"notthisone":2*a,
"thisone":(a*b),
}
print(a.apply(lambda row: p4(row.a,row.b)["thisone"], axis=1))
and the result is
0 30
1 2
2 18
3 4
dtype: int64
So now I have to think of a way to incorporate next row values too

If you only need the values of the very next row, I think it would be best to shift these values back into the current row (with different column names). Then they can all be accessed by row-wise apply(fn, axis=1).
# library function
def process(a, b, c, d):
return {
"notthisone": 2 * a,
"thisone": (a * b + c * d),
}
# toy data
df = pd.DataFrame([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0], [7.0, 8.0]], columns=["a", "b"])
# shift some data back one row
df[["c", "d"]] = df[["a", "b"]].shift(-1)
# apply your function row-wise
df["result"] = df.apply(
lambda x: process(x["a"], x["b"], x["c"], x["d"])["thisone"], axis=1
)
Result:
a b c d result
0 1.0 2.0 3.0 4.0 14.0
1 3.0 4.0 5.0 6.0 42.0
2 5.0 6.0 7.0 8.0 86.0
3 7.0 8.0

Use loc accessor to select the rows, turn them into a numpy object and find the product and sum. I used list squares in this case. Last row will be Null. fillna the resulting column. We can fillna at the df level but that could impact other columns if the df is large and has nulls. Code below.
a = a.assign(x=pd.Series([np.prod(a.iloc[x].to_numpy()) + np.prod(a.iloc[x+1].to_numpy()) for x in np.arange(len(a)) if x!=len(a)-1]))
a =a.assign(x=a['x'].ffill())
a b x
0 5 6 32.0
1 1 2 20.0
2 3 6 22.0
3 4 1 22.0

Dynamically Fill NaN Values in Dataframe

I have a dataframe with a series of numbers. For example:
Index Column 1
1 10
2 12
3 24
4 NaN
5 20
6 15
7 NaN
8 NaN
9 2
I can't use bfill or ffill as the rule is dynamic, taking the value from the previous row and dividing by the number of consecutive NaN + 1. For example, rows 3 and 4 should be replaced with 12 as 24/2, rows 6, 7 and 8 should be replaced with 5. All other numbers should remain unchanged.
How should I do that?
Note: Edited the dataframe to be more general by inserting a new row between rows 4 and 5 and another row at the end.

You can do:
m = (df["Column 1"].notna()) & (
(df["Column 1"].shift(-1).isna()) | (df["Column 1"].shift().isna())
)
out = df.groupby(m.cumsum()).transform(
lambda x: x.fillna(0).mean() if x.isna().any() else x
)
print(out):
Index Column 1
0 1 10.0
1 2 12.0
2 3 12.0
3 4 12.0
4 5 20.0
5 6 5.0
6 7 5.0
7 8 5.0
8 9 2.0
Explanation and intermediate values:
Basically look for the rows where the next value is NaN or previous value is NaN but their value itself is not NaN. Those rows form the first row of such groups.
So the m in above code looks like:
0 True
1 False
2 True
3 False
4 True
5 True
6 False
7 False
8 True
now I want to form groups of rows that are ['True', <all Falses>] because those are the groups I want to take average of. For that use cumsum
If you want to take a look at those groups, you can use ngroup() after groupby on m.cumsum():
0 0
1 0
2 1
3 1
4 2
5 3
6 3
7 3
8 4
The above is only to show what are the groups.
Now for each group you can get the mean of the group if the group has any NaN value. This is accomplished by checking for NaNs using x.isna().any().
If the group has any NaN value then assign mean after filling NaN with 0 ,otherwise just keep the group as is. This is accomplished by the lambda:
lambda x: x.fillna(0).mean() if x.isna().any() else x

Why not using interpolate? There is a method=s that would probably fitsyour desire
However, if you really want to do as you described above, you can do something like this. (Note that iterating over rows in pandas is considered bad practice, but it does the job)
import pandas as pd
import numpy as np
df = pd.DataFrame([10,
12,
24,
np.NaN,
15,
np.NaN,
np.NaN])
for col in df:
for idx in df.index: # (iterating over rows is considered bad practice)
local_idx=idx
while(local_idx+1<len(df) and np.isnan(df.at[local_idx+1,col])):
local_idx+=1
if (local_idx-idx)>0:
fillvalue = df.loc[idx]/(local_idx-idx+1)
for fillidx in range(idx, local_idx+1):
df.loc[fillidx] = fillvalue
df
Output:
0
0 10.0
1 12.0
2 12.0
3 12.0
4 5.0
5 5.0
6 5.0

Insert row in dataframe between every existing row and it must contain information from the previous and next rows

Okay this one I'm a little bit stuck on.
I have a dataframe like this:
time Throttle Vout
0 1056.65785 1 8
1 1056.66255 2 8
2 1056.66785 3 9
3 1056.67330 4 11
4 1056.67840 5 15
and I need to add a row between every existing row - the whole dataset is about 21000 rows. The time should be equal to the time in the next row. Any other columns should have the values of the previous row.
So the outcome would be something like this:
time Throttle Vout
0 1056.65785 1 8
1 1056.66255 1 8 <---- new row
2 1056.66255 2 8
3 1056.66785 2 8 <---- new row
4 1056.66785 3 9
5 1056.67330 3 9 <---- new row
6 1056.67330 4 11
7 1056.67840 4 11 <---- new row
8 1056.67840 5 15
I've looked into df.apply() but not sure where to start
Serge Ballesta answer:
So this works with the test data supplied above. When I test it on a much larger DataFrame I start to see some errors. I originally thought it was something wrong in my PyCharm but testing with a larger dataset in powershell proved otherwise.
Quang Hoang answer:
So this also worked on a small scale but when using a larger dataset it seemed to have quite a few issues with both time and the other columns. I've highlighted some in the image below. The top df is the original and the bottom is the altered.
Valdi_Bo Answer
The additional columns seemed to work well with this but there seems to be an issue with the times columns on larger datasets. I've highlighted some below.

You can use a combination of concat and ffill:
(pd.concat([df, df[['time']].shift(-1)])
.sort_index(kind='mergesort')
.dropna(how='all')
.ffill()
)
Output:
time Throttle Vout
0 1056.65785 1.0 8.0
0 1056.66255 1.0 8.0
1 1056.66255 2.0 8.0
1 1056.66785 2.0 8.0
2 1056.66785 3.0 9.0
2 1056.67330 3.0 9.0
3 1056.67330 4.0 11.0
3 1056.67840 4.0 11.0
4 1056.67840 5.0 15.0

I would build a copy of the dataframe, shift its time column, concatenate it to the original dataframe and sort the result according to time:
df2 = df.copy()
df2['time'] = df['time'].shift(-1)
result =
df2[~df2['time'].isna()].append(df).sort_values('time').reset_index(drop=True)
It gives as expected:
time Throttle Vout
0 1056.65785 1 8
1 1056.66255 1 8
2 1056.66255 2 8
3 1056.66785 2 8
4 1056.66785 3 9
5 1056.67330 3 9
6 1056.67330 4 11
7 1056.67840 4 11
8 1056.67840 5 15

This might look a bit overwhelming, but the idea is that you merge the original dataframe with its copy whose values in Throttle & Vout columns are shifted by 1:
pd.concat([
df,
df.loc[:,'Throttle':].shift(1).combine_first(df)
]).reset_index().loc[1:,].sort_values(['time', 'Throttle'])

First compute an auxiliary DataFrame - a copy of df, with time column
shifted 1 place up and without the last original row:
df2 = df.copy()
df2.time = df2.time.shift(-1)
df2.dropna(inplace=True)
The result, for your input sample, is:
time Throttle Vout
0 1056.66255 1 8
1 1056.66785 2 8
2 1056.67330 3 9
3 1056.67840 4 11
and these are the new rows to insert.
To get a concatenation of these 2 DataFrames, in proper order, run:
df = pd.concat([df, df2], keys=[1, 2]).swaplevel().sort_index().reset_index(drop=True)
To guarantee the proper order of rows, I added to the previous solution:
keys - to add "origin indicators", but they area added as the top
level of the MultiIndex,
swaplevel - swap levels of the MulitiIndex, to provide proper sort
by index (in the next step).

Pandas: How to use (df.groupby) in a lambda formula

The example below:
import pandas as pd
list1 = ['a','a','a','b','b','b','b','c','c','c']
list2 = range(len(list1))
df = pd.DataFrame(zip(list1, list2), columns= ['Item','Value'])
df
gives:
required: GroupFirstValue column as shown below.
The idea is to use a lambda formula to get the 'first' value for each group..for example "a"'s first value is 0, "b"'s first value is 3, "c"'s first value is 7. That's why those numbers appear in the GroupFirstValue column.
Note: I know that I can do this on 2 steps...one is the original df and the second is a grouped by df and then merge them together. The idea is to see if this can be done more efficiently in a single step. Many thanks in advance!

groupby and use first
df.groupby('Item')['Value'].first()
or you can use transform and assign to a new column in your frame
df['new_col'] = df.groupby('Item')['Value'].transform('first')

Use mask and duplicated
df['GroupFirstValue'] = df.Value.mask(df.Item.duplicated())
Out[109]:
Item Value GroupFirstValue
0 a 0 0.0
1 a 1 NaN
2 a 2 NaN
3 b 3 3.0
4 b 4 NaN
5 b 5 NaN
6 b 6 NaN
7 c 7 7.0
8 c 8 NaN
9 c 9 NaN

Convert data type of multiple columns with for loop

I have a 21840x39 data frame. A few of my columns are numerically valued and I want to make sure they are all in the same data type (which I want to be a float).
Instead of naming all the columns out and converting them:
df[['A', 'B', 'C', '...]] = df[['A', 'B', 'C', '...]].astype(float)
Can I do a for loop that will allow me to say something like " convert to float from column 18 to column 35"
I know how to do one column: df['A'] = df['A'].astype(float)
But how can I do multiple columns? I tried with list slicing within a loop but couldn't get it right.

First idea is convert selected columns, python counts from 0, so for 18 to 36 columns use:
df.iloc[:, 17:35] = df.iloc[:, 17:35].astype(float)
If not working (because possible bug) use another solution:
df = df.astype(dict.fromkeys(df.columns[17:35], float))
Sample - convert 8 to 15th columns:
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(10, size=(3, 18)),
columns=list('abcdefghijklmnopqr')).astype(str)
print (df)
a b c d e f g h i j k l m n o p q r
0 0 8 3 6 3 3 7 8 0 0 8 9 3 7 2 3 6 5
1 0 4 8 6 4 1 1 5 9 5 6 6 6 5 4 6 4 2
2 3 4 7 1 4 9 3 2 0 9 1 2 7 1 0 2 8 8
df = df.astype(dict.fromkeys(df.columns[7:15], float))
print (df)
a b c d e f g h i j k l m n o p q r
0 0 8 3 6 3 3 7 8.0 0.0 0.0 8.0 9.0 3.0 7.0 2.0 3 6 5
1 0 4 8 6 4 1 1 5.0 9.0 5.0 6.0 6.0 6.0 5.0 4.0 6 4 2
2 3 4 7 1 4 9 3 2.0 0.0 9.0 1.0 2.0 7.0 1.0 0.0 2 8 8

Tweaked #jezrael code as typing in column names (I feel) is a good option.
import pandas as pd
import numpy as np
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(10, size=(3, 18)),
columns=list('abcdefghijklmnopqr')).astype(str)
print(df)
columns = list(df.columns)
#change the first and last column names below as required
df = df.astype(dict.fromkeys(
df.columns[columns.index('h'):(columns.index('o')+1)], float))
print (df)
Leaving the original answer below here but note: Never loop in pandas if vectorized alternatives exist
If I had a dataframe and wanted to change columns 'col3' to 'col5' (human readable names) to floats I could...
import pandas as pd
import re
df = pd.read_csv('dummy_data.csv')
df
columns = list(df.columns)
#change the first and last column names below as required
start_column = columns.index('col3')
end_column = columns.index('col5')
for index, col in enumerate(columns):
if (start_column <= index) & (index <= end_column):
df[col] = df[col].astype(float)
df
...by just changing the column names. Perhaps it's easier to work in column names and 'from this one' and 'to that one' (inclusive).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Averaging values between files, but keeping non-matching values - python

Related

How do I add a column to a dataframe based on values from other columns?

Dynamically Fill NaN Values in Dataframe

Insert row in dataframe between every existing row and it must contain information from the previous and next rows

Pandas: How to use (df.groupby) in a lambda formula

Convert data type of multiple columns with for loop

Categories

Resources