Pandas integrate over columns per each row

Pandas integrate over columns per each row - python

In a simplified dataframe:
import pandas as pd
df1 = pd.DataFrame({'350': [7.898167, 6.912074, 6.049002, 5.000357, 4.072320],
'351': [8.094912, 7.090584, 6.221289, 5.154516, 4.211746],
'352': [8.291657, 7.269095, 6.393576, 5.308674, 4.351173],
'353': [8.421007, 7.374317, 6.496641, 5.403691, 4.439815],
'354': [8.535562, 7.463452, 6.584512, 5.485725, 4.517310],
'355': [8.650118, 7.552586, 6.672383, 4.517310, 4.594806]},
index=[1, 2, 3, 4, 5])
int_range = df1.columns.astype(float)
a = 0.005
b = 0.837
I would like to solve an equation which is attached as an image below:
I is equal to the values in the data frame. x is the int_range values so in this case from 350 to 355 with a dx=1.
a and b are optional constants
I need to get a dataframe as an output per each row
For now I do something like this, but I'm not sure it's correct:
dict_INT = {}
for index, row in df1.iterrows():
func = df1.loc[index]*df1.loc[index].index.astype('float')
x = df1.loc[index].index.astype('float')
dict_INT[index] = integrate.trapz(func, x)
df_out = pd.DataFrame(dict_INT, index=['INT']).T
df_fin = df_out/(a*b)
This is the final sum I get per row:
1 3.505796e+06
2 3.068796e+06
3 2.700446e+06
4 2.199336e+06
5 1.840992e+06

I solved this by first converting the dataframe to dict and then performing your equation by each item in row, then writing these value to dict using collections defaultdict. I will break it down:
import pandas as pd
from collections import defaultdict
df1 = pd.DataFrame({'350': [7.898167, 6.912074, 6.049002, 5.000357, 4.072320],
'351': [8.094912, 7.090584, 6.221289, 5.154516, 4.211746],
'352': [8.291657, 7.269095, 6.393576, 5.308674, 4.351173],
'353': [8.421007, 7.374317, 6.496641, 5.403691, 4.439815],
'354': [8.535562, 7.463452, 6.584512, 5.485725, 4.517310],
'355': [8.650118, 7.552586, 6.672383, 4.517310, 4.594806]},
index=[1, 2, 3, 4, 5]
)
int_range = df1.columns.astype(float)
a = 0.005
b = 0.837
dx = 1
df_dict = df1.to_dict() # convert df to dict for easier operations
integrated_dict = {} # initialize empty dict
d = defaultdict(list) # initialize empty dict of lists for tuples later
integrated_list = []
for k,v in df_dict.items(): # unpack df dict of dicts
for x,y in v.items(): # unpack dicts by column and index (x is index, y is column)
integrated_list.append((k, (((float(k)*float(y)*float(dx))/(a*b))))) #store a list of tuples.
for x,y in integrated_list: # create dict with column header as key and new integrated calc as value (currently a tuple)
d[x].append(y)
d = {k:tuple(v) for k, v in d.items()} # unpack to multiple values
integrated_df = pd.DataFrame.from_dict(d) # to df
integrated_df['Sum'] = integrated_df.iloc[:, :].sum(axis=1)
output (updated to include sum):
350 351 352 353 354 \
0 660539.653524 678928.103226 697410.576822 710302.382557 722004.527599
1 578070.704898 594694.141935 611402.972521 622015.269056 631317.086738
2 505890.250896 521785.529032 537763.142652 547984.294624 556969.473835
3 418189.952210 432314.245161 446512.126165 455795.202628 464025.483871
4 340576.344086 353243.212903 365976.797133 374493.356033 382109.376344
355 Sum
0 733761.502987 4.202947e+06
1 640661.416965 3.678162e+06
2 565996.646356 3.236389e+06
3 383188.781362 2.600026e+06
4 389762.516129 2.206162e+06

Related

How to assign dynamic variables calling from a function in python

I have a function which does a bunch of stuff and returns pandas dataframes. The dataframe is extracted from a dynamic list and hence I'm using the below method to return these dataframes.
As soon as I call the function (code in 2nd block), my jupyter notebook just runs the cell infinitely like some infinity loop. Any idea how I can do this more efficiently.
funct(x):
some code which creates multiple dataframes
i = 0
for k in range(len(dynamic_list)):
i += 1
return globals()["df" + str(i)]
The next thing I do is call the function and try to assign it dynamically,
i = 0
for k in range(len(dynamic_list)):
i += 1
globals()["new_df" + str(i)] = funct(x)
I have tried returning selective dataframes from first function and it works just fine, like,
funct(x):
some code returning df1, df2, df3....., df_n
return df1, df2
new_df1, new_df2 = funct(x)

for each dataframe object your code is creating you can simply add it to a dictionary and set the key from your dynamic list.
Here is a simple example:
import pandas as pd
test_data = {"key1":[1, 2, 3], "key2":[1, 2, 3], "key3":[1, 2, 3]}
df = pd.DataFrame.from_dict(test_data)
dataframe example:
key1 key2 key3
0 1 1 1
1 2 2 2
2 3 3 3
I have used a fixed list of values to focus on but this can be dynamic based on however you are creating them.
values_of_interest_list = [1, 3]
Now we can do whatever we want to do with the dataframe, in this instance, I want to filter only data where we have a value from our list.
data_dict = {}
for value_of_interest in values_of_interest_list:
x_df = df[df["key1"] == value_of_interest]
data_dict[value_of_interest] = x_df
To see what we have, we can print out the created dictionary that contains the key we have assigned and the associated dataframe object.
for key, value in data_dict.items():
print(type(key))
print(type(value))
Which returns
<class 'int'>
<class 'pandas.core.frame.DataFrame'>
<class 'int'>
<class 'pandas.core.frame.DataFrame'>
Full sample code is below:
import pandas as pd
test_data = {"key1":[1, 2, 3], "key2":[1, 2, 3], "key3":[1, 2, 3]}
df = pd.DataFrame.from_dict(test_data)
values_of_interest_list = [1, 3]
# Dictionary for data
data_dict = {}
# Loop though the values of interest
for value_of_interest in values_of_interest_list:
x_df = df[df["key1"] == value_of_interest]
data_dict[value_of_interest] = x_df
for key, value in data_dict.items():
print(type(key))
print(type(value))

Get dict keys using pandas apply

i want to get values from the dict that looks like
pair_devices_count =
{('tWAAAA.jg', 'ttNggB.jg'): 1,
('tWAAAM.jg', 'ttWVsM.jg'): 2,
('tWAAAN.CV', 'ttNggB.AS'): 1,
('tWAAAN.CV', 'ttNggB.CV'): 2,
('tWAAAN.CV', 'ttNggB.QG'): 1}
(Pairs of domain)
But when i use
train_data[['domain', 'target_domain']].apply(lambda x: pair_devices_count.get((x), 0))
it raises an error, because pandas series are not hashable
How can i get dict values to generate column
train['pair_devices_count']?

you cannot apply on multiple columns. You can try this :
train_data.apply(lambda x: pair_devices_count[(x.domain, x.target_domain)], axis=1)

pandas series are not hashable
Convert pd.Series to tuple before using .get consider following simple example
import pandas as pd
d = {('A','A'):1,('A','B'):2,('A','C'):3}
df = pd.DataFrame({'X':['A','A','A'],'Y':['C','B','A'],'Z':['X','Y','Z']})
df['d'] = df[['X','Y']].apply(lambda x:d.get(tuple(x)),axis=1)
print(df)
output
X Y Z d
0 A C X 3
1 A B Y 2
2 A A Z 1

How to access elements of Collections Counter stored as column in dataframe to be used in CountVectorizer

One of the columns in the dataframe is in the following format
Row 1 :
Counter({'First': 3, 'record': 2})
Row 2 :
Counter({'Second': 2, 'record': 1}).
I want to create a new column which has the following value:
Row 1 :
First First First record record
Row 2 :
Second Second record

I was able to solve the question myself by the following code. It is very much related to regex.
def transform_word_count(text):
words = re.findall(r'\'(.+?)\'',text)
n = re.findall(r"[0-9]",text)
result = []
for i in range(len(words)):
for j in range(int(n[i])):
result.append(words[i])
return result
df['new'] = df.apply(lambda row: transform_word_count(row['old']), axis=1)

Use apply with iter values of counter and join with space - first repeated values and then together:
import ast
#convert values to dictionaries
df['col'] = df['col'].str.extract('\((.+)\)', expand=False).apply(ast.literal_eval)
df['new'] = df['col'].apply(lambda x: ' '.join(' '.join([k] * v) for k, v in x.items()))
print (df)
col new
0 {'First': 3, 'record': 2} First First First record record
1 {'Second': 2, 'record': 1} Second Second record
Or list comprehension:
df['new'] = [' '.join(' '.join([k] * v) for k, v in x.items()) for x in df['col']]

Get values from one dataframe where they are between an interval in another dataframe

Given a dataframe containing a numeric (float) series and a categorical ID (df). How can I create a dictionary in the form 'key': [] where the key is an ID from the dataframe and the list contains the difference between the numbers in the separate dataframes?
I have managed this using loops though I am looking for a more pandas way of doing this.
import pandas as pd
from collections import defaultdict
df = pd.DataFrame({'a': [0.75435, 0.74897, 0.60949,
0.87438, 0.90885, 0.28547,
0.27327, 0.31078, 0.15576,
0.58139],
'id': list('aaaxxbbyyy')})
rl = pd.DataFrame({'b': [0.51, 0.30], 'id': ['aaa', 'bbb']})
interval = 0.1
d = defaultdict(list)
for index, row in rl.iterrows():
before = df[df['a'].between(row['b'] - interval, row['b'], inclusive=False)]
after = df[df['a'].between(row['b'], row['b'] + interval, inclusive=True)]
for x, b_row in before.iterrows():
d[b_row['id']].append((b_row['a'] - row['b']))
for x, a_row in after.iterrows():
d[a_row['id']].append((a_row['a'] - row['b']))
for k, v in d.items():
print('{k}\t{v}'.format(k=k, v=len(v)))
a 1
y 2
b 2
d
defaultdict(list,
{'a': [0.09948],
'b': [-0.01452, -0.02672],
'y': [0.07138, 0.01078]})

How to replace elements of a DataFrame from other indicated columns

I have a DataFrame like:
df = pd.DataFrame([{'v1':'a', 'v2':'b', 'v3':'1'},
{'v1':'2', 'v2':'c', 'v3':'d'}])
or
v1 v2 v3
0 a b 1
1 2 c d
When the contents of a column/row is '1', '2' or '3', I would like to replace its contents with the corresponding item from the column indicated. I.e., in the first row, column v3 has value "1" so I would like to replace it with the value of the first element in column v1. Doing this for both rows, I should get:
v1 v2 v3
0 a b a
1 c c d
I can do this with the following code:
for i in range(3):
for j in range(3):
df.loc[df['v%d' % (i+1)]==('%d' % (j+1)),'v%d' % (i+1)]= \
df.loc[df['v%d' % (i+1)]==('%d' % (j+1)),'v%d' % (j+1)]
Is there a less cumbersome way to do this?

df.apply(lambda row: [row['v'+v] if 'v'+v in row else v for v in row], 1)
This iterates over each row and replaces any value v with the value in column named 'v'+v if that column exists, otherwise it does not change the value.
output:
v1 v2 v3
0 a b a
1 c c d
Note that this will not limit the replacements to digits only. For example, if you have a column named 'va', it will replace all cells that contain 'a' with the value in the 'va' column in a that row. To limit the rows that you can replace from, you can define a list of acceptable column names. For example, lets say you only wanted to make replacements from column v1:
acceptable_columns = ['v1']
df.apply(lambda row: [row['v'+v] if 'v'+v in acceptable_columns else v for v in row], 1)
output:
v1 v2 v3
0 a b a
1 2 c d
EDIT
It was pointed out that the answer above throws an error if you have non-string types in your dataframe. You can avoid this by explicitly converting each cell value to a string:
df.apply(lambda row: [row['v'+str(v)] if 'v'+str(v) in row else v for v in row], 1)
ORIGINAL (INCORRECT) ANSWER BELOW
note that the answer below only applies when the values to replace are on a diagonal (which is the case in the example but that was not the question asked ... my bad)
You can do this with pandas' replace method and numpy's diag method:
First select the values to replace, these will be the digits 1 to the length of your dataframe:
to_replace = [str(i) for i in range(1,len(df)+1)]
Then select values that each should be replaced with, these will be the diagonal of your data frame:
import numpy as np
replace_with = np.diag(df)
Now you can do the actual replacement:
df.replace(to_replace, replace_with)
which gives:
v1 v2 v3
0 a b a
1 c c d
And of course if you want the whole thing as a one liner:
df.replace([str(i) for i in range(1,len(df)+1)], np.diag(df))
Add the inplace=True keyword arg to replace if you want to do the replacement in place.

I see 2 options.
Loop over the columns and then over the mapping
mapping = {'1': 'v1', '3': 'v3', '2': 'v2'}
df1 = df.copy()
for column_name, column in df1.iteritems():
for k, v in mapping.items():
df1.loc[column == k, column_name] = df1.loc[column == k, v]
df1
v1 v2 v3
0 a b a
1 c c d
Loop over the columns, then loop over all the 'hits'
df2 = df.copy()
for column_name, column in df2.iteritems():
hits = column.isin(mapping.keys())
for idx, item in column[hits].iteritems():
df2.loc[idx, column_name] = df2.loc[idx, mapping[item]]
df2
v1 v2 v3
0 a b a
1 c c d
If you've chosen a way, you could reduce the 2 nested for-loops to 1 for-loop with itertools.product

I made this:
df = pd.DataFrame([{'v1':'a', 'v2':'b', 'v3':'1'},
{'v1':'2', 'v2':'c', 'v3':'d'}])
def replace_col(row, columns, col_num_dict={1: 'v1', 2: 'v2', 3: 'v3'}):
for col in columns:
x = getattr(row, col)
try:
x = int(x)
if int(x) in col_num_dict.keys():
setattr(row, col, getattr(row, col_num_dict[int(x)]))
except ValueError:
pass
return row
df = df.apply(replace_col, axis=1, args=(df.columns,))
It applies the function replace_col on every row. The row object's attributes which correspond to its columns get replaced with the right value from the same row. It looks a bit complicated due to the multiple set/get attribute functions, but it does exactly what is needed without too much overhead.

you can modify the data before convert to df
data = [{'v1':'a', 'v2':'b', 'v3':'1'},{'v1':'2', 'v2':'c', 'v3':'d'}]
mapping = {'1': 'v1', '3': 'v3', '2': 'v2'}
for idx,line in enumerate(data):
... for item in line:
... try:
... int(line[item ])
... data[idx][item ] = data[idx][mapping[line[item ]]]
... except Exception:
... pass
[{'v1': 'a', 'v2': 'b', 'v3': 'a'}, {'v1': 'c', 'v2': 'c', 'v3': 'd'}]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas integrate over columns per each row - python

Related

How to assign dynamic variables calling from a function in python

Get dict keys using pandas apply

How to access elements of Collections Counter stored as column in dataframe to be used in CountVectorizer

Get values from one dataframe where they are between an interval in another dataframe

How to replace elements of a DataFrame from other indicated columns

Categories

Resources