Pandas integrate over columns per each row - python

In a simplified dataframe:
import pandas as pd
df1 = pd.DataFrame({'350': [7.898167, 6.912074, 6.049002, 5.000357, 4.072320],
'351': [8.094912, 7.090584, 6.221289, 5.154516, 4.211746],
'352': [8.291657, 7.269095, 6.393576, 5.308674, 4.351173],
'353': [8.421007, 7.374317, 6.496641, 5.403691, 4.439815],
'354': [8.535562, 7.463452, 6.584512, 5.485725, 4.517310],
'355': [8.650118, 7.552586, 6.672383, 4.517310, 4.594806]},
index=[1, 2, 3, 4, 5])
int_range = df1.columns.astype(float)
a = 0.005
b = 0.837
I would like to solve an equation which is attached as an image below:
I is equal to the values in the data frame. x is the int_range values so in this case from 350 to 355 with a dx=1.
a and b are optional constants
I need to get a dataframe as an output per each row
For now I do something like this, but I'm not sure it's correct:
dict_INT = {}
for index, row in df1.iterrows():
func = df1.loc[index]*df1.loc[index].index.astype('float')
x = df1.loc[index].index.astype('float')
dict_INT[index] = integrate.trapz(func, x)
df_out = pd.DataFrame(dict_INT, index=['INT']).T
df_fin = df_out/(a*b)
This is the final sum I get per row:
1 3.505796e+06
2 3.068796e+06
3 2.700446e+06
4 2.199336e+06
5 1.840992e+06

I solved this by first converting the dataframe to dict and then performing your equation by each item in row, then writing these value to dict using collections defaultdict. I will break it down:
import pandas as pd
from collections import defaultdict
df1 = pd.DataFrame({'350': [7.898167, 6.912074, 6.049002, 5.000357, 4.072320],
'351': [8.094912, 7.090584, 6.221289, 5.154516, 4.211746],
'352': [8.291657, 7.269095, 6.393576, 5.308674, 4.351173],
'353': [8.421007, 7.374317, 6.496641, 5.403691, 4.439815],
'354': [8.535562, 7.463452, 6.584512, 5.485725, 4.517310],
'355': [8.650118, 7.552586, 6.672383, 4.517310, 4.594806]},
index=[1, 2, 3, 4, 5]
)
int_range = df1.columns.astype(float)
a = 0.005
b = 0.837
dx = 1
df_dict = df1.to_dict() # convert df to dict for easier operations
integrated_dict = {} # initialize empty dict
d = defaultdict(list) # initialize empty dict of lists for tuples later
integrated_list = []
for k,v in df_dict.items(): # unpack df dict of dicts
for x,y in v.items(): # unpack dicts by column and index (x is index, y is column)
integrated_list.append((k, (((float(k)*float(y)*float(dx))/(a*b))))) #store a list of tuples.
for x,y in integrated_list: # create dict with column header as key and new integrated calc as value (currently a tuple)
d[x].append(y)
d = {k:tuple(v) for k, v in d.items()} # unpack to multiple values
integrated_df = pd.DataFrame.from_dict(d) # to df
integrated_df['Sum'] = integrated_df.iloc[:, :].sum(axis=1)
output (updated to include sum):
350 351 352 353 354 \
0 660539.653524 678928.103226 697410.576822 710302.382557 722004.527599
1 578070.704898 594694.141935 611402.972521 622015.269056 631317.086738
2 505890.250896 521785.529032 537763.142652 547984.294624 556969.473835
3 418189.952210 432314.245161 446512.126165 455795.202628 464025.483871
4 340576.344086 353243.212903 365976.797133 374493.356033 382109.376344
355 Sum
0 733761.502987 4.202947e+06
1 640661.416965 3.678162e+06
2 565996.646356 3.236389e+06
3 383188.781362 2.600026e+06
4 389762.516129 2.206162e+06

Related

How to assign dynamic variables calling from a function in python

I have a function which does a bunch of stuff and returns pandas dataframes. The dataframe is extracted from a dynamic list and hence I'm using the below method to return these dataframes.
As soon as I call the function (code in 2nd block), my jupyter notebook just runs the cell infinitely like some infinity loop. Any idea how I can do this more efficiently.
funct(x):
some code which creates multiple dataframes
i = 0
for k in range(len(dynamic_list)):
i += 1
return globals()["df" + str(i)]
The next thing I do is call the function and try to assign it dynamically,
i = 0
for k in range(len(dynamic_list)):
i += 1
globals()["new_df" + str(i)] = funct(x)
I have tried returning selective dataframes from first function and it works just fine, like,
funct(x):
some code returning df1, df2, df3....., df_n
return df1, df2
new_df1, new_df2 = funct(x)
for each dataframe object your code is creating you can simply add it to a dictionary and set the key from your dynamic list.
Here is a simple example:
import pandas as pd
test_data = {"key1":[1, 2, 3], "key2":[1, 2, 3], "key3":[1, 2, 3]}
df = pd.DataFrame.from_dict(test_data)
dataframe example:
key1 key2 key3
0 1 1 1
1 2 2 2
2 3 3 3
I have used a fixed list of values to focus on but this can be dynamic based on however you are creating them.
values_of_interest_list = [1, 3]
Now we can do whatever we want to do with the dataframe, in this instance, I want to filter only data where we have a value from our list.
data_dict = {}
for value_of_interest in values_of_interest_list:
x_df = df[df["key1"] == value_of_interest]
data_dict[value_of_interest] = x_df
To see what we have, we can print out the created dictionary that contains the key we have assigned and the associated dataframe object.
for key, value in data_dict.items():
print(type(key))
print(type(value))
Which returns
<class 'int'>
<class 'pandas.core.frame.DataFrame'>
<class 'int'>
<class 'pandas.core.frame.DataFrame'>
Full sample code is below:
import pandas as pd
test_data = {"key1":[1, 2, 3], "key2":[1, 2, 3], "key3":[1, 2, 3]}
df = pd.DataFrame.from_dict(test_data)
values_of_interest_list = [1, 3]
# Dictionary for data
data_dict = {}
# Loop though the values of interest
for value_of_interest in values_of_interest_list:
x_df = df[df["key1"] == value_of_interest]
data_dict[value_of_interest] = x_df
for key, value in data_dict.items():
print(type(key))
print(type(value))

Get dict keys using pandas apply

i want to get values from the dict that looks like
pair_devices_count =
{('tWAAAA.jg', 'ttNggB.jg'): 1,
('tWAAAM.jg', 'ttWVsM.jg'): 2,
('tWAAAN.CV', 'ttNggB.AS'): 1,
('tWAAAN.CV', 'ttNggB.CV'): 2,
('tWAAAN.CV', 'ttNggB.QG'): 1}
(Pairs of domain)
But when i use
train_data[['domain', 'target_domain']].apply(lambda x: pair_devices_count.get((x), 0))
it raises an error, because pandas series are not hashable
How can i get dict values to generate column
train['pair_devices_count']?
you cannot apply on multiple columns. You can try this :
train_data.apply(lambda x: pair_devices_count[(x.domain, x.target_domain)], axis=1)
pandas series are not hashable
Convert pd.Series to tuple before using .get consider following simple example
import pandas as pd
d = {('A','A'):1,('A','B'):2,('A','C'):3}
df = pd.DataFrame({'X':['A','A','A'],'Y':['C','B','A'],'Z':['X','Y','Z']})
df['d'] = df[['X','Y']].apply(lambda x:d.get(tuple(x)),axis=1)
print(df)
output
X Y Z d
0 A C X 3
1 A B Y 2
2 A A Z 1

How to access elements of Collections Counter stored as column in dataframe to be used in CountVectorizer

One of the columns in the dataframe is in the following format
Row 1 :
Counter({'First': 3, 'record': 2})
Row 2 :
Counter({'Second': 2, 'record': 1}).
I want to create a new column which has the following value:
Row 1 :
First First First record record
Row 2 :
Second Second record
I was able to solve the question myself by the following code. It is very much related to regex.
def transform_word_count(text):
words = re.findall(r'\'(.+?)\'',text)
n = re.findall(r"[0-9]",text)
result = []
for i in range(len(words)):
for j in range(int(n[i])):
result.append(words[i])
return result
df['new'] = df.apply(lambda row: transform_word_count(row['old']), axis=1)
Use apply with iter values of counter and join with space - first repeated values and then together:
import ast
#convert values to dictionaries
df['col'] = df['col'].str.extract('\((.+)\)', expand=False).apply(ast.literal_eval)
df['new'] = df['col'].apply(lambda x: ' '.join(' '.join([k] * v) for k, v in x.items()))
print (df)
col new
0 {'First': 3, 'record': 2} First First First record record
1 {'Second': 2, 'record': 1} Second Second record
Or list comprehension:
df['new'] = [' '.join(' '.join([k] * v) for k, v in x.items()) for x in df['col']]

Get values from one dataframe where they are between an interval in another dataframe

Given a dataframe containing a numeric (float) series and a categorical ID (df). How can I create a dictionary in the form 'key': [] where the key is an ID from the dataframe and the list contains the difference between the numbers in the separate dataframes?
I have managed this using loops though I am looking for a more pandas way of doing this.
import pandas as pd
from collections import defaultdict
df = pd.DataFrame({'a': [0.75435, 0.74897, 0.60949,
0.87438, 0.90885, 0.28547,
0.27327, 0.31078, 0.15576,
0.58139],
'id': list('aaaxxbbyyy')})
rl = pd.DataFrame({'b': [0.51, 0.30], 'id': ['aaa', 'bbb']})
interval = 0.1
d = defaultdict(list)
for index, row in rl.iterrows():
before = df[df['a'].between(row['b'] - interval, row['b'], inclusive=False)]
after = df[df['a'].between(row['b'], row['b'] + interval, inclusive=True)]
for x, b_row in before.iterrows():
d[b_row['id']].append((b_row['a'] - row['b']))
for x, a_row in after.iterrows():
d[a_row['id']].append((a_row['a'] - row['b']))
for k, v in d.items():
print('{k}\t{v}'.format(k=k, v=len(v)))
a 1
y 2
b 2
d
defaultdict(list,
{'a': [0.09948],
'b': [-0.01452, -0.02672],
'y': [0.07138, 0.01078]})

How to replace elements of a DataFrame from other indicated columns

I have a DataFrame like:
df = pd.DataFrame([{'v1':'a', 'v2':'b', 'v3':'1'},
{'v1':'2', 'v2':'c', 'v3':'d'}])
or
v1 v2 v3
0 a b 1
1 2 c d
When the contents of a column/row is '1', '2' or '3', I would like to replace its contents with the corresponding item from the column indicated. I.e., in the first row, column v3 has value "1" so I would like to replace it with the value of the first element in column v1. Doing this for both rows, I should get:
v1 v2 v3
0 a b a
1 c c d
I can do this with the following code:
for i in range(3):
for j in range(3):
df.loc[df['v%d' % (i+1)]==('%d' % (j+1)),'v%d' % (i+1)]= \
df.loc[df['v%d' % (i+1)]==('%d' % (j+1)),'v%d' % (j+1)]
Is there a less cumbersome way to do this?
df.apply(lambda row: [row['v'+v] if 'v'+v in row else v for v in row], 1)
This iterates over each row and replaces any value v with the value in column named 'v'+v if that column exists, otherwise it does not change the value.
output:
v1 v2 v3
0 a b a
1 c c d
Note that this will not limit the replacements to digits only. For example, if you have a column named 'va', it will replace all cells that contain 'a' with the value in the 'va' column in a that row. To limit the rows that you can replace from, you can define a list of acceptable column names. For example, lets say you only wanted to make replacements from column v1:
acceptable_columns = ['v1']
df.apply(lambda row: [row['v'+v] if 'v'+v in acceptable_columns else v for v in row], 1)
output:
v1 v2 v3
0 a b a
1 2 c d
EDIT
It was pointed out that the answer above throws an error if you have non-string types in your dataframe. You can avoid this by explicitly converting each cell value to a string:
df.apply(lambda row: [row['v'+str(v)] if 'v'+str(v) in row else v for v in row], 1)
ORIGINAL (INCORRECT) ANSWER BELOW
note that the answer below only applies when the values to replace are on a diagonal (which is the case in the example but that was not the question asked ... my bad)
You can do this with pandas' replace method and numpy's diag method:
First select the values to replace, these will be the digits 1 to the length of your dataframe:
to_replace = [str(i) for i in range(1,len(df)+1)]
Then select values that each should be replaced with, these will be the diagonal of your data frame:
import numpy as np
replace_with = np.diag(df)
Now you can do the actual replacement:
df.replace(to_replace, replace_with)
which gives:
v1 v2 v3
0 a b a
1 c c d
And of course if you want the whole thing as a one liner:
df.replace([str(i) for i in range(1,len(df)+1)], np.diag(df))
Add the inplace=True keyword arg to replace if you want to do the replacement in place.
I see 2 options.
Loop over the columns and then over the mapping
mapping = {'1': 'v1', '3': 'v3', '2': 'v2'}
df1 = df.copy()
for column_name, column in df1.iteritems():
for k, v in mapping.items():
df1.loc[column == k, column_name] = df1.loc[column == k, v]
df1
v1 v2 v3
0 a b a
1 c c d
Loop over the columns, then loop over all the 'hits'
df2 = df.copy()
for column_name, column in df2.iteritems():
hits = column.isin(mapping.keys())
for idx, item in column[hits].iteritems():
df2.loc[idx, column_name] = df2.loc[idx, mapping[item]]
df2
v1 v2 v3
0 a b a
1 c c d
If you've chosen a way, you could reduce the 2 nested for-loops to 1 for-loop with itertools.product
I made this:
df = pd.DataFrame([{'v1':'a', 'v2':'b', 'v3':'1'},
{'v1':'2', 'v2':'c', 'v3':'d'}])
def replace_col(row, columns, col_num_dict={1: 'v1', 2: 'v2', 3: 'v3'}):
for col in columns:
x = getattr(row, col)
try:
x = int(x)
if int(x) in col_num_dict.keys():
setattr(row, col, getattr(row, col_num_dict[int(x)]))
except ValueError:
pass
return row
df = df.apply(replace_col, axis=1, args=(df.columns,))
It applies the function replace_col on every row. The row object's attributes which correspond to its columns get replaced with the right value from the same row. It looks a bit complicated due to the multiple set/get attribute functions, but it does exactly what is needed without too much overhead.
you can modify the data before convert to df
data = [{'v1':'a', 'v2':'b', 'v3':'1'},{'v1':'2', 'v2':'c', 'v3':'d'}]
mapping = {'1': 'v1', '3': 'v3', '2': 'v2'}
for idx,line in enumerate(data):
... for item in line:
... try:
... int(line[item ])
... data[idx][item ] = data[idx][mapping[line[item ]]]
... except Exception:
... pass
[{'v1': 'a', 'v2': 'b', 'v3': 'a'}, {'v1': 'c', 'v2': 'c', 'v3': 'd'}]

Categories

Resources