Python pandas question matching rows, finding indexes and creating column - python

I am learning Python and trying to solve a problem but got stuck here. I would like to do the following:
The dataframe is called: df_cleaned_sessions
It contains two columns with timestamps:
datetime_only_first_engagement
datetime_sessions
For you information the datetime_only_first_engagement column has a lot less timestamps than the datetime_sessions, the sessions column has a lot of NA values as this dataframe is a result of a left join.
I would like to do the following:
Find rows where datetime_only_first_engagement timestamp equals the timestamp from datetime_sessions, save the index from those rows, and create a new column in the dataframe called 'is_conversion', and set those (matching timestamps) indexes to True. The other indexes should be set to False.
Hope someone can help me!
Thanks a lot.

it would have been easier if you had provided a sample code and an expected output, however by reading your question I feel you would want to do the following:
import pandas as pd
Lets build a sample df:
df = pd.DataFrame([[1, 2], [3, 4], [5, 6], [7, 8],[10,11]], columns=["A", "B"])
print(df)
A B
0 1 2
1 3 4
2 5 6
3 7 8
4 10 11
Lets assume df1 to be :
df1 = pd.DataFrame([[1, 2], [3, 4], [5, 6], [7, 8],[9,10]], columns=["D", "E"])
print(df1)
D E
0 1 2
1 3 4
2 5 6
3 7 8
4 9 10
Apply the below code to check if column A in df exists in Column D of df1:
df['is_conversion']= df['A'].isin(df1['D']).astype(bool)
print(df)
A B is_conversion
0 1 2 True
1 3 4 True
2 5 6 True
3 7 8 True
4 10 11 False
Similarly for your question, you can apply the same logic in matching different columns of the same dataframe too. I think you need:
df_cleaned_sessions['is_conversion'] = df_cleaned_sessions['datetime_only_first_engagement'].isin(df_cleaned_sessions['datetime_sessions']).astype(bool)
Based on the comments: add this below the above code:
df_cleaned_sessions['is_conversion'] = df_cleaned_sessions['is_conversion'].replace({True:1, False:0})
Alternative answer using np.where:
import numpy as np
df_cleaned_sessions['is_conversion'] = np.where(df_cleaned_sessions['datetime_only_first_engagement'].isin(df_cleaned_sessions['datetime_sessions']),True,False)
Hope that helps..!

From what I understand, you need numpy.where:
import numpy as np
df_cleaned_sessions['is_conversion'] = np.where(df_cleaned_sessions['datetime_only_first_engagement'] == df_cleaned_sessions['datetime_sessions'], True, False)

df_cleaned_sessions['is_conversion'] = df_cleaned_sessions['datetime_only_first_engagement'] == df_cleaned_sessions['datetime_sessions']

Related

Get multiple columns of DataFrame by filter (pandas.filter `like` for multiple strings)

What is the easist way to retrieve all columns from a DataFrame based on a part of their name, but with different selection patterns? I am too lazy to type the exact column names, and I am looking for a convenient way.
Say, I want all columns that start with the code 'T1*' or 'T2*' - ideally a command like
df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]], columns={'T101', 'T201', 'G101'})
df.filter(like=['T1', 'T2'])
which is not supported, as like='' only takes 1 string.
Slow workaround I currently use:
col_list = df.columns
target_cols = [e for e in col_list if any(se in e for se in ['T1','T2'])]
df[target_cols]
which gives me the desired output:
>>> df[target_cols]
T201 T101
0 2 3
1 5 6
2 8 9
Is there any faster/more convenient way?
Note this is not about the actual data in the rows/I am not looking for a mask to select rows.
you can use regex parameter in filter() method:
df.filter(regex='^T1|^T2')
output:
T101 T201
0 1 3
1 4 6
2 7 9

New column in dataset based em last value of item

I have this dataset
In [4]: df = pd.DataFrame({'A':[1, 2, 3, 4, 5]})
In [5]: df
Out[5]:
A
0 1
1 2
2 3
3 4
4 5
I want to add a new column in dataset based em last value of item, like this
A
New Column
1
2
1
3
2
4
3
5
4
I tryed to use apply with iloc, but it doesn't worked
Can you help
Thank you
With your shown samples, could you please try following. You could use shift function to get the new column which will move all elements of given column into new column with a NaN in first element.
import pandas as pd
df['New_Col'] = df['A'].shift()
OR
In case you would like to fill NaNs with zeros then try following, approach is same as above for this one too.
import pandas as pd
df['New_Col'] = df['A'].shift().fillna(0)

Pandas: select value from random column on each row

Suppose I have the following Pandas DataFrame:
df = pd.DataFrame({
'a': [1, 2, 3],
'b': [4, 5, 6],
'c': [7, 8, 9]
})
a b c
0 1 4 7
1 2 5 8
2 3 6 9
I want to generate a new pandas.Series so that the values of this series are selected, row by row, from a random column in the DataFrame. So, a possible output for that would be the series:
0 7
1 2
2 9
dtype: int64
(where in row 0 it randomly chose 'c', in row 1 it randomly chose 'a' and in row 2 it randomly chose 'c' again).
I know this can be done by iterating over the rows and using random.choice to choose each row, but iterating over the rows not only has bad performance but also is "unpandonic", so to speak. Also, df.sample(axis=1) would choose whole columns, so all of them would be chosen from the same column, which is not what I want. Is there a better way to do this with vectorized pandas methods?
Here is a fully vectorized solution. Note however that it does not use Pandas methods, but rather involves operations on the underlying numpy array.
import numpy as np
indices = np.random.choice(np.arange(len(df.columns)), len(df), replace=True)
Example output is [1, 2, 1] which corresponds to ['b', 'c', 'b'].
Then use this to slice the numpy array:
df['random'] = df.to_numpy()[np.arange(len(df)), indices]
Results:
a b c random
0 1 4 7 7
1 2 5 8 5
2 3 6 9 9
May be something like:
pd.Series([np.random.choice(i,1)[0] for i in df.values])
This does the job (using the built-in module random):
ddf = df.apply(lambda row : random.choice(row.tolist()), axis=1)
or using pandas sample:
ddf = df.apply(lambda row : row.sample(), axis=1)
Both have the same behaviour. ddf is your Series.
pd.DataFrame(
df.values[range(df.shape[0]),
np.random.randint(
0, df.shape[1], size=df.shape[0])])
output
0
0 4
1 5
2 9
You're probably still going to need to iterate through each row while selecting a random value in each row - whether you do it explicitly with a for loop or implicitly with whatever function you decide to call.
You can, however, simplify the to a single line using a list comprehension, if it suits your style:
result = pd.Series([random.choice(pd.iloc[i]) for i in range(len(df))])

How to convert a Pandas series into a Dataframe for merging [duplicate]

If you came here looking for information on how to
merge a DataFrame and Series on the index, please look at this
answer.
The OP's original intention was to ask how to assign series elements
as columns to another DataFrame. If you are interested in knowing the
answer to this, look at the accepted answer by EdChum.
Best I can come up with is
df = pd.DataFrame({'a':[1, 2], 'b':[3, 4]}) # see EDIT below
s = pd.Series({'s1':5, 's2':6})
for name in s.index:
df[name] = s[name]
a b s1 s2
0 1 3 5 6
1 2 4 5 6
Can anybody suggest better syntax / faster method?
My attempts:
df.merge(s)
AttributeError: 'Series' object has no attribute 'columns'
and
df.join(s)
ValueError: Other Series must have a name
EDIT The first two answers posted highlighted a problem with my question, so please use the following to construct df:
df = pd.DataFrame({'a':[np.nan, 2, 3], 'b':[4, 5, 6]}, index=[3, 5, 6])
with the final result
a b s1 s2
3 NaN 4 5 6
5 2 5 5 6
6 3 6 5 6
Update
From v0.24.0 onwards, you can merge on DataFrame and Series as long as the Series is named.
df.merge(s.rename('new'), left_index=True, right_index=True)
# If series is already named,
# df.merge(s, left_index=True, right_index=True)
Nowadays, you can simply convert the Series to a DataFrame with to_frame(). So (if joining on index):
df.merge(s.to_frame(), left_index=True, right_index=True)
You could construct a dataframe from the series and then merge with the dataframe.
So you specify the data as the values but multiply them by the length, set the columns to the index and set params for left_index and right_index to True:
In [27]:
df.merge(pd.DataFrame(data = [s.values] * len(s), columns = s.index), left_index=True, right_index=True)
Out[27]:
a b s1 s2
0 1 3 5 6
1 2 4 5 6
EDIT for the situation where you want the index of your constructed df from the series to use the index of the df then you can do the following:
df.merge(pd.DataFrame(data = [s.values] * len(df), columns = s.index, index=df.index), left_index=True, right_index=True)
This assumes that the indices match the length.
Here's one way:
df.join(pd.DataFrame(s).T).fillna(method='ffill')
To break down what happens here...
pd.DataFrame(s).T creates a one-row DataFrame from s which looks like this:
s1 s2
0 5 6
Next, join concatenates this new frame with df:
a b s1 s2
0 1 3 5 6
1 2 4 NaN NaN
Lastly, the NaN values at index 1 are filled with the previous values in the column using fillna with the forward-fill (ffill) argument:
a b s1 s2
0 1 3 5 6
1 2 4 5 6
To avoid using fillna, it's possible to use pd.concat to repeat the rows of the DataFrame constructed from s. In this case, the general solution is:
df.join(pd.concat([pd.DataFrame(s).T] * len(df), ignore_index=True))
Here's another solution to address the indexing challenge posed in the edited question:
df.join(pd.DataFrame(s.repeat(len(df)).values.reshape((len(df), -1), order='F'),
columns=s.index,
index=df.index))
s is transformed into a DataFrame by repeating the values and reshaping (specifying 'Fortran' order), and also passing in the appropriate column names and index. This new DataFrame is then joined to df.
Nowadays, much simpler and concise solution can achieve the same task. Leveraging the capability of DataFrame.apply() to turn a Series into columns of its belonging DataFrame, we can use:
df.join(df.apply(lambda x: s, axis=1))
Result:
a b s1 s2
3 NaN 4 5 6
5 2.0 5 5 6
6 3.0 6 5 6
Here, we used DataFrame.apply() with a simple lambda function as the applied function on axis=1. The applied lambda function simply just returns the Series s:
df.apply(lambda x: s, axis=1)
Result:
s1 s2
3 5 6
5 5 6
6 5 6
The result has already inherited the row index of the original DataFrame df. Consequently, we can simply join df with this interim result by DataFrame.join() to get the desired final result (since they have the same row index).
This capability of DataFrame.apply() to turn a Series into columns of its belonging DataFrame is well documented in the official document as follows:
By default (result_type=None), the final return type is inferred from
the return type of the applied function.
The default behaviour (result_type=None) depends on the return value of the
applied function: list-like results will be returned as a Series of
those. However if the apply function returns a Series these are
expanded to columns.
The official document also includes example of such usage:
Returning a Series inside the function is similar to passing
result_type='expand'. The resulting column names will be the Series
index.
df.apply(lambda x: pd.Series([1, 2], index=['foo', 'bar']), axis=1)
foo bar
0 1 2
1 1 2
2 1 2
If I could suggest setting up your dataframes like this (auto-indexing):
df = pd.DataFrame({'a':[np.nan, 1, 2], 'b':[4, 5, 6]})
then you can set up your s1 and s2 values thus (using shape() to return the number of rows from df):
s = pd.DataFrame({'s1':[5]*df.shape[0], 's2':[6]*df.shape[0]})
then the result you want is easy:
display (df.merge(s, left_index=True, right_index=True))
Alternatively, just add the new values to your dataframe df:
df = pd.DataFrame({'a':[nan, 1, 2], 'b':[4, 5, 6]})
df['s1']=5
df['s2']=6
display(df)
Both return:
a b s1 s2
0 NaN 4 5 6
1 1.0 5 5 6
2 2.0 6 5 6
If you have another list of data (instead of just a single value to apply), and you know it is in the same sequence as df, eg:
s1=['a','b','c']
then you can attach this in the same way:
df['s1']=s1
returns:
a b s1
0 NaN 4 a
1 1.0 5 b
2 2.0 6 c
You can easily set a pandas.DataFrame column to a constant. This constant can be an int such as in your example. If the column you specify isn't in the df, then pandas will create a new column with the name you specify. So after your dataframe is constructed, (from your question):
df = pd.DataFrame({'a':[np.nan, 2, 3], 'b':[4, 5, 6]}, index=[3, 5, 6])
You can just run:
df['s1'], df['s2'] = 5, 6
You could write a loop or comprehension to make it do this for all the elements in a list of tuples, or keys and values in a dictionary depending on how you have your real data stored.
If df is a pandas.DataFrame then df['new_col']= Series list_object of length len(df) will add the or Series list_object as a column named 'new_col'. df['new_col']= scalar (such as 5 or 6 in your case) also works and is equivalent to df['new_col']= [scalar]*len(df)
So a two-line code serves the purpose:
df = pd.DataFrame({'a':[1, 2], 'b':[3, 4]})
s = pd.Series({'s1':5, 's2':6})
for x in s.index:
df[x] = s[x]
Output:
a b s1 s2
0 1 3 5 6
1 2 4 5 6

How to edit/add two columns to a dataframe in pandas at once - df.apply()

So I've been doing things like this with pandas:
usrdata['columnA'] = usrdata.apply(functionA, axis=1)
in order to do row operations and changing/adding columns to my dataframe.
However, now I want to try to do something like this:
usrdata['columnB', 'columnC'] = usrdata.apply(functionB, axis=1)
But the output of function B is a Series with only one column in a tuple (with two values for each row) apparently. Is there a nice way for me to either:
format the output from functionB so it can readily be added to my
dataframe
add (and possibly have to unpack) the output from functionB and assign each each column to each column of my dataframe?
Try using zip:
usrdata['columnB'], usrdata['columnC'] = zip(*usrdata.apply(functionB, axis=1))
I'd assign directly to a df consisting of your new df's and modify the func body to return a Series constructed with a list of the data:
In [9]:
df = pd.DataFrame({'a':[1, 2, 3, 4, 5]})
df
Out[9]:
a
0 1
1 2
2 3
3 4
4 5
In [10]:
def func(x):
return pd.Series([x*3, x*10])
​
df[['b','c']] = df['a'].apply(func)
df
Out[10]:
a b c
0 1 3 10
1 2 6 20
2 3 9 30
3 4 12 40
4 5 15 50

Categories

Resources