Pandas new column based on condition on two other dataframe - python

df1 and df2 are of different sizes. Set the df1(row1, 'Z') value to df2(row2, 'C') value when df1(row1, 'A') is equal to df2(row2, 'B').
What is the recommended way to implement df1['Z'] = df2['C'] if df1['A']==df2['B']?
df1 = pd.DataFrame({'A': ['foo', 'bar', 'test'], 'b': [1, 2, 3], 'c': [3, 4, 5]})
df2 = pd.DataFrame({'B': ['foo', 'baz'], 'C': [3, 1]})
df1
A b c
0 foo 1 3
1 bar 2 4
2 test 3 5
df2
B C
0 foo 3
1 baz 1
After change
df1
A b c Z
0 foo 1 3 3
1 bar 2 4 NaN
2 test 3 5 NaN
What if there require multiple assignments following multiple conditions. Is iterating over rows recommended as shown below?
for i, row in df1.iterrows():
if <condition(s)>:
do assignment(s): df.at[i, 'hjk']=something

You can use numpy.where, passing the condition as df1.A equals df2.B, and for true boolean, take df2.C else take df1.Z:
np.where(df1.A.eq(df2.B), df2.C, df1.Z)
Assign above result to df1.Z
SAMPLE:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'A': np.random.randint(5,10,20), 'Z': np.random.randint(5,10,20)})
df2 = pd.DataFrame({'C': np.random.randint(5,10,20), 'B': np.random.randint(5,10,20)})
>>>df1.Z.values
Out[41]: array([7, 6, 7, 7, 6, 8, 9, 7, 6, 6, 7, 6, 8, 7, 8, 8, 9, 6, 7, 7])
>>> np.where(df1.A.eq(df2.B), df2.C, df1.Z)
Out[42]: array([7, 6, 6, 7, 6, 8, 9, 7, 6, 9, 7, 8, 8, 7, 8, 8, 9, 6, 7, 7])

I would like to try map
df1['Z'] = df1['A'].map(dict(zip(df2['B'],df2['C'])))

Related

Dataframe age column grouping in pandas [duplicate]

It seems like a simple question, but I need ur help.
For example, I have df:
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = [2, 1, 3, 1, 8, 9, 6, 7, 4, 6]
How can I group 'x' in range from 1 to 5, and from 6 to 10 and calc mean 'y' value for this two bins?
I expect to get new df like:
x_grpd = [5, 10]
y_grpd = [3, 6.4]
Range of 'x' is given as an example. Ideally i want to be able to set any int value to get different bins quantity.
You can use cut and groupby.mean:
bins = [5, 10]
df2 = (df
.groupby(pd.cut(df['x'], [0]+bins,
labels=bins,
right=True))
['y'].mean()
.reset_index()
)
Output:
x y
0 5 3.0
1 10 6.4

Appending columns to other columns in Pandas

Given the dataframe:
d = {'col1': [1, 2, 3, 4, 7], 'col2': [4, 5, 6, 9, 5], 'col3': [7, 8, 12, 1, 11], 'col4': [12, 13, 14, 15, 16]}
What is the easiest way to append the third column to the first and the fourth column to the second?
The result should look like.
d = {'col1': [1, 2, 3, 4, 7, 7, 8, 12, 1, 11], 'col2': [4, 5, 6, 9, 5, 12, 13, 14, 15, 16],
I need to use this for a script with different column names, thus referencing columns by name is not possible. I have tried something along the lines of df.iloc[:,x] to achieve this.
You can use:
out = pd.concat([subdf.set_axis(['col1', 'col2'], axis=1)
for _, subdf in df.groupby(pd.RangeIndex(df.shape[1]) // 2, axis=1)])
print(out)
# Output
col1 col2
0 1 4
1 2 5
2 3 6
3 4 9
4 7 5
0 7 12
1 8 13
2 12 14
3 1 15
4 11 16
You can change the column names and concat:
pd.concat([df[['col1', 'col2']],
df[['col3', 'col4']].set_axis(['col1', 'col2'], axis=1)])
Add ignore_index=True to reset the index in the process.
Output:
col1 col2
0 1 4
1 2 5
2 3 6
3 4 9
4 7 5
0 7 12
1 8 13
2 12 14
3 1 15
4 11 16
Or, using numpy:
N = 2
pd.DataFrame(
df
.values.reshape((-1,df.shape[1]//2,N))
.reshape(-1,N,order='F'),
columns=df.columns[:N]
)
This may not be the most efficient solution but, you can do it using the pd.concat() function in pandas.
First convert your initial dict d into a pandas Dataframe and then apply the concat function.
d = {'col1': [1, 2, 3, 4, 7], 'col2': [4, 5, 6, 9, 5], 'col3': [7, 8, 12, 1, 11], 'col4': [12, 13, 14, 15, 16]}
df = pd.DataFrame(d)
d_2 = {'col1':pd.concat([df.iloc[:,0],df.iloc[:,2]]),'col2':pd.concat([df.iloc[:,1],df.iloc[:,3]])}
d_2 is your required dict. Convert it to a dataframe if you need it to,
df_2 = pd.DataFrame(d_2)

extract elements of tuple from a pandas series

I have a pandas series with data of type tuple as list elements. The length of the tuple is exactly 2 and there are a bunch of NaNs. I am trying to split each list in the tuple into its own column.
import pandas as pd
import numpy as np
df = pd.DataFrame({'val': [([1,2,3],[4,5,6]),
([7,8,9],[10,11,12]),
np.nan]
})
Expected Output:
If you know the lenght of tuples are exactly 2, you can do:
df["x"] = df.val.str[0]
df["y"] = df.val.str[1]
print(df[["x", "y"]])
Prints:
x y
0 [1, 2, 3] [4, 5, 6]
1 [7, 8, 9] [10, 11, 12]
2 NaN NaN
You could also convert the column to a list and cast it to the DataFrame constructor (fill None with np.nan as well):
out = pd.DataFrame(df['val'].tolist(), columns=['x','y']).fillna(np.nan)
Output:
x y
0 [1, 2, 3] [4, 5, 6]
1 [7, 8, 9] [10, 11, 12]
2 NaN NaN
One way using pandas.Series.apply:
new_df = df["val"].apply(pd.Series)
print(new_df)
Output:
0 1
0 [1, 2, 3] [4, 5, 6]
1 [7, 8, 9] [10, 11, 12]
2 NaN NaN

picking values from columns [duplicate]

This question already has answers here:
Vectorized lookup on a pandas dataframe
(3 answers)
Closed 3 years ago.
I have a pandas DataFrame with values in a number of columns, make it two for simplicity, and a column of column names I want to use to pick values from the other columns:
import pandas as pd
import numpy as np
np.random.seed(1337)
df = pd.DataFrame(
{"a": np.arange(10), "b": 10 - np.arange(10), "c": np.random.choice(["a", "b"], 10)}
)
which gives
> df['c']
0 b
1 b
2 a
3 a
4 b
5 b
6 b
7 a
8 a
9 a
Name: c, dtype: object
That is, I want the first and second elements to be picked from column b, the third from a and so on.
This works:
def pick_vals_from_cols(df, col_selector):
condlist = np.row_stack(col_selector.map(lambda x: x == df.columns))
values = np.select(condlist.transpose(), df.values.transpose())
return values
> pick_vals_from_cols(df, df["c"])
array([10, 9, 2, 3, 6, 5, 4, 7, 8, 9], dtype=object)
But it just feels so fragile and clunky. Is there a better way to do this?
lookup
df.lookup(df.index, df.c)
array([10, 9, 2, 3, 6, 5, 4, 7, 8, 9])
Comprehension
But why when you have lookup?
[df.at[t] for t in df.c.items()]
[10, 9, 2, 3, 6, 5, 4, 7, 8, 9]
Bonus Hack
Not intended for actual use
[*map(df.at.__getitem__, zip(df.index, df.c))]
[10, 9, 2, 3, 6, 5, 4, 7, 8, 9]
Because df.get_value is deprecated
[*map(df.get_value, df.index, df.c)]
FutureWarning: get_value is deprecated and will be removed in a future release. Please use .at[] or .iat[] accessors instead
[10, 9, 2, 3, 6, 5, 4, 7, 8, 9]

Pandas data frame to dictionary of lists

How to use Python or Pandas (preferably) to convert a Pandas DataFrame to dictionary of lists for input into highcharts?
The closest I got was:
df.T.to_json('bar.json', orient='index')
But this is a dict of dicts instead of dict of lists.
My input:
import pandas
import numpy as np
df = pandas.DataFrame({
"date": ['2014-10-1', '2014-10-2', '2014-10-3', '2014-10-4', '2014-10-5'],
"time": [1, 2, 3, 4, 5],
"temp": np.random.random_integers(0, 10, 5),
"foo": np.random.random_integers(0, 10, 5)
})
df2 = df.set_index(['date'])
df2
Output:
time temp foo
date
2014-10-1 1 3 0
2014-10-2 2 8 7
2014-10-3 3 4 9
2014-10-4 4 4 8
2014-10-5 5 6 2
Desired Output: I am using this output in Highcharts, which requires it to be a dictionary of lists like so:
{'date': ['2014-10-1', '2014-10-2', '2014-10-3', '2014-10-4', '2014-10-5'],
'foo': [7, 2, 5, 5, 6],
'temp': [8, 6, 10, 10, 3],
'time': [1, 2, 3, 4, 5]}
In [199]: df2.reset_index().to_dict(orient='list')
Out[199]:
{'date': ['2014-10-1', '2014-10-2', '2014-10-3', '2014-10-4', '2014-10-5'],
'foo': [8, 1, 8, 8, 1],
'temp': [10, 10, 8, 3, 10],
'time': [1, 2, 3, 4, 5]}
to create a list of the dictionaries per line
post_data_list = []
for i in df2.index:
data_dict = {}
for column in df2.columns:
data_dict[column] = df2[column][i]
post_data_list.append(data_dict)

Categories

Resources