Create new column with reverse order using pandas dataframe

Create new column with reverse order using pandas dataframe - python

I need to create a new column X containing the reverse order value of x shown below.
x X
aa01 01aa
bb02 02bb
cc03 03cc
I did slice and concatenate them manually and it worked anyway, but I am looking for a "smarter" way doing this.
df["X1"] = df["x"].str.slice(0,2)
df["X2"] = df["x"].str.slice(2,4)
df['X'] = df["X2"]+ df["X1"].map(str)

Faster way would be using list comprehension instead of pandas str functions:
df['X'] = [s[-2:]+s[:2] for s in df.x]

Related

Is there a way to add two arrays in two columns in to a third array using pands

I am working on a project, which uses pandas data frame. So in there, I received some values in to the columns as below.
In there, I need to add this Pos_vec column and word_vec column and need to create a new column called the sum_of_arrays. And the size of the third column's array size should 2.
Eg: pos_vec Word_vec sum_of_arrays
[-0.22683072, 0.32770252] [0.3655883, 0.2535131] [0.13875758,0.58121562]
Is there anyone who can help me? I'm stuck in here. :(

If you convert them to np.array you can simply sum them.
import pandas as pd
import numpy as np
df = pd.DataFrame({'pos_vec':[[-0.22683072,0.32770252],[0.14382899,0.049593687],[-0.24300802,-0.0908088],[-0.2507714,-0.18816864],[0.32294357,0.4486494]],
'word_vec':[[0.3655883,0.2535131],[0.33788466,0.038143277], [-0.047320127,0.28842866],[0.14382899,0.049593687],[-0.24300802,-0.0908088]]})
If you want to use numpy
df['col_sum'] = df[['pos_vec','word_vec']].applymap(lambda x: np.array(x)).sum(1)
If you don't want to use numpy
df['col_sum'] = df.apply(lambda x: [sum(x) for x in zip(x.pos_vec,x.word_vec)], axis=1)

There are maybe cleaner approaches possible using pandas to iterate over the columns, however this is the solution I came up with by extracting the data from the DataFrame as lists:
# Extract data as lists
pos_vec = df["pos_vec"].tolist()
word_vec = df["word_vec"].tolist()
# Create new list with desired calculation
sum_of_arrays = [[x+y for x,y in zip(l1, l2)] for l1,l2 in zip(pos,word)]
# Add new list to DataFrame
df["sum_of_arrays"] = sum_of_arrays

how to generate column in pandas data frame using other columns and string formatting

I am trying to generate a third column in pandas dataframe using two other columns in dataframe. The requirement is very particular to the scenario for which I need to generate the third column data.
The requirement is stated as:
let the dataframe name be df, first column be 'first_name'. second column be 'last_name'.
I need to generate third column in such a manner so that it uses string formatting to generate a particular string and pass it to a function and whatever the function returns should be used as value to third column.
Problem 1
base_string = "my name is {first} {last}"
df['summary'] = base_string.format(first=df['first_name'], last=df['last_name'])
Problem 2
df['summary'] = some_func(base_string.format(first=df['first_name'], last=df['last_name']))
My ultimate goal is to solve problem 2 but for that problem 1 is pre-requisite and as of now I'm unable to solve that. I have tried converting my dataframe values to string but it is not working the way I expected.

You can do apply:
df.apply(lambda r: base_string.format(first=r['first_name'], last=r['last_name']) ),
axis=1)
Or list comprehension:
df['summary'] = [base_string.format(first=x,last=y)
for x,y in zip(df['first_name'], df['last_name'])
And then, for general function some_func:
df['summary'] = [some_func(base_string.format(first=x,last=y) )
for x,y in zip(df['first_name'], df['last_name'])

You could use pandas.DataFrame.apply with axis=1 so your code will look like this:
def mapping_function(row):
#make your calculation
return value
df['summary'] = df.apply(mapping_function, axis=1)

Apply Pandas series string function to the whole dataframe

I want to apply the method pd.Series.str.join() to my whole dataframe
A B
[foo,bar] [1,2]
[bar,foo] [3,4]
Desired output:
A B
foobar 12
barfoo 34
For now I used a quite slow method:
a = [df[x].str.join('') for x in df.columns]
I tried
df.apply(pd.Series.str.join)
and
df.agg(pd.Series.str.join)
and
df.applymap(str.join)
but none of them seem to work. For extension of the question, how can I efficiently apply series method to the whole dataframe?
Thank you.

There will always be a problem when trying to joim on lists that contain numeric values, that's why I suggest we first turn the into strings. Afterwards, we can solve it with a nested list comprehension:
df = pd.DataFrame({'A':[['Foo','Bar'],['Bar','Foo']],'B':[[1,2],[3,4]]})
df['B'] = df['B'].map(lambda x: [str(i) for i in x])
df_new = pd.DataFrame([[''.join(x) for x in df[i]] for i in df],index=df.columns).T
Which correctly outputs:
A B
FooBar 12
BarFoo 34

import pandas as pd
df=pd.DataFrame({'A':[['foo','bar'],['bar','foo']],'B':[[1,2],[3,4]]})
#If 'B' is list of integers, else the below step can be ignored
df['B']=df['B'].transform(lambda value: [str(x) for x in value])
df=df.applymap(lambda value:''.join(value))
Explanation: applymap() helps to apply any function to each value of your dataframe

I came up with this solution:
df_sum = df_sum.stack().str.join('').unstack()
I have a quite big dataframe, so for loop is not really scalable.

Pandas loc is returning series not df

The following code returns a series for y when I want a df. Ultimately I am pulling rows out of a larger raw df (df) to create a smaller df (Cand) of results. I have created Cand as the new empty df to be populated.
Cand = pd.DataFrame(columns=['SR','Hits','Breaks'])
x = df.loc[df['Breaks'] == 0]
y = x.loc[x['Hits'].idxmax()]
Cand.append(y)
x is correctly reflected as a df, but y becomes a series and so does not populate Cand.
I have looked around but cannot find a similar problem. Thanks in advance.

Your issue would not be that you aren't passing a DataFrame to append(), but that .append() here is not in-place; try reassigning the return of append() to Cand as Cand = Cand.append(y), given that append returns your initial DataFrame + other (Cand + y, in this case).
Side Note:
You can return a DataFrame from .loc by using double square brackets.
Example: y = x.loc[[x['Hits'].idxmax()]]

compare list of dictionaries to dataframe, show missing values

I have a list of dictionaries
example_list = [{'email':'myemail#email.com'},{'email':'another#email.com'}]
and a dataframe with an 'Email' column
I need to compare the list against the dataframe and return the values that are not in the dataframe.
I can certainly iterate over the list, check in the dataframe, but I was looking for a more pythonic way, perhaps using list comprehension or perhaps a map function in dataframes?

To return those values that are not in DataFrame.email, here's a couple of options involving set difference operations—
np.setdiff1d
emails = [d['email'] for d in example_list)]
diff = np.setdiff1d(emails, df['Email']) # returns a list
set.difference
# returns a set
diff = set(d['email'] for d in example_list)).difference(df['Email'])

One way is to take one set from another. For a functional solution you can use operator.itemgetter:
from operator import itemgetter
res = set(map(itemgetter('email'), example_list)) - set(df['email'])
Note - is syntactic sugar for set.difference.

I ended up converting the list into a dataframe, comparing the two dataframes by merging them on a column, and then creating a dataframe out of the missing values
so, for example
example_list = [{'email':'myemail#email.com'},{'email':'another#email.com'}]
df_two = pd.DataFrame(item for item in example_list)
common = df_one.merge(df_two, on=['Email'])
df_diff = df_one[(~df_one.Email.isin(common.Email))]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create new column with reverse order using pandas dataframe - python

Faster way would be using list comprehension instead of pandas str functions: df['X'] = [s[-2:]+s[:2] for s in df.x]

Related

Is there a way to add two arrays in two columns in to a third array using pands

how to generate column in pandas data frame using other columns and string formatting

Apply Pandas series string function to the whole dataframe

Pandas loc is returning series not df

compare list of dictionaries to dataframe, show missing values

Categories

Resources