Function multiple arguments dataframe Python - python

I have a huge dataframe with a lot of dates. I want to apply a function with multiple arguments to a set of those columns in order to create a new one in this dataframe.
The function I have is the following (it works correctly):
def func(*args):
count=0
for i in args:
if i=="Cool":
count+=1
return count
I create a new column in my dataframe applying this function to a set of columns:
dates=["2000","2001","2002","2003","2004","2005","2006","2007","2009",]
df["new_Column"]=df.apply(lambda row : func(row[date] for date in dates), axis = 1)
However, after execution my new_Column is constantly equal to zero. The problem comes from the last line for sure. Any ideas?

It's because you're passing generator object as only argument to func. Since generator object is not 'Cool' you get 0.
As others noticed your question is not complete. But as far as I can anticipate you have dataframe which looks like this
import pandas as pd
df = pd.DataFrame({'2000': ['Cool', 'yay', 'nope'], '2001': ['ugly', 'cool', 'nice']})
So you can rewrite your func
def func(lst):
count=0
for i in lst:
if i=="Cool":
count+=1
return count
And create new column with list constructor
df["new_Column"]=df.apply(lambda row : func(list(row[date] for date in ['2000', '2001'])), axis = 1)
and receive
2000 2001 new_Column
0 Cool ugly 1
1 yay cool 0
2 nope nice 0
If this is the case there is pure pandas solution
df['new_Column2']=df[df.isin(['Cool'])].count(axis=1)

Related

Passing an argument with a range of values into a single-argument python function

I wrote a simple python script to concatenate values from the first row of one dataframe with all rows of another dataframe.
Snippet of one of the dataframes (the second one has identical number of columns):
ID Sequence
1 ATGCCCC
2 GCTCCAC
...
My code:
...
def mixer(x):
for row in df1.iterrows():
fdf["New_ID"]=df1.loc[x, "First_ID"]+df2["Second_ID"]
fdf["Sequence"]=df1.loc[x, "Sequence"]+df2["Sequence"]
print(fdf)
mixer(0)
mixer(1)
mixer(2)
...
Currently my first dataframe has only 8 rows but in the future I may have up to a 1000.
How can I avoid repeatedly calling the function for each value of the argument x (as you can see at the end of the code snippet)?
I tried using "range" and putting row numbers into a list/tuple and passing it through the function but neither worked.
Would be grateful for your help!
Why don't you try:
for x in range(num):
mixer(x)
or you could try:
num = 1
while num < other_num:
mixer(num)
But this seems so obvius judging by the complexity of your question, so in case this isn't what you were expecting, could you be more specific?

Adding new dask column based on a vectorized function

I have what I thought would be a straightforward thing to do in python using dask. I have a dataframe with some records in it, and I want to add a new column based on calling a function with values from two other columns as parameters.
Here is what I mean (pretend ge exists and takes two parameters):
def gc(x, y):
return ge(x, y)
def gdf(df):
func1 = np.vectorize(gc)
gh = da.from_array(func1(df.x, df.y))
df['gh'] = gh
However, I seem to get one issue or another no matter what I try to do. Currently, in the above state, I get
Number of partitions do not match (2 != 33)
It feels like I'm either going about this all wrong (like maybe I need map_blocks or map_partitions or even gufunc), or I'm missing something easy where I can set the number of partitions on my array to match that of my dataframe.
Any help would be appreciated.
It should be possible to do this with assign or map_partitions:
func1 = np.vectorize(gc)
df = df.assign(gh=lambda df: func1(df.x, df.y))
# or try this
def myfunc(df):
df['gh'] = func1(df.x, df.y)
return df
df = df.map_partitions(myfunc)

python for loop using index to create values in dataframe

I have a very simple for loop problem and I haven't found a solution in any of the similar questions on Stack. I want to use a for loop to create values in a pandas dataframe. I want the values to be strings that contain a numerical index. I can make the correct value print, but I can't make this value get saved in the dataframe. I'm new to python.
# reproducible example
import pandas as pd
df1 = pd.DataFrame({'x':range(5)})
# for loop to add a row with an index
for i in range(5):
print("data_{i}.txt".format(i=i)) # this prints the value that I want
df1['file'] = "data_{i}.txt".format(i=i)
This loop prints the exact value that I want to put into the 'file' column of df1, but when I look at df1, it only uses the last value for the index.
x file
0 0 data_4.txt
1 1 data_4.txt
2 2 data_4.txt
3 3 data_4.txt
4 4 data_4.txt
I have tried using enumerate, but can't find a solution with this. I assume everyone will yell at me for posting a duplicate question, but I have not found anything that works and if someone points me to a solution that solves this problem, I'll happily remove this question.
There are better ways to create a DataFrame, but to answer your question:
Replace the last line in your code:
df1['file'] = "data_{i}.txt".format(i=i)
with:
df1.loc[i, 'file'] = "data_{0}.txt".format(i)
For more information, read about the .loc here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html
On the same page, you can read about accessors like .at and .iloc as well.
You can do list-comprehension:
df1['file'] = ["data_{i}.txt".format(i=i) for i in range(5)]
print(df1)
Prints:
x file
0 0 data_0.txt
1 1 data_1.txt
2 2 data_2.txt
3 3 data_3.txt
4 4 data_4.txt
OR at the creating of DataFrame:
df1 = pd.DataFrame({'x':range(5), 'file': ["data_{i}.txt".format(i=i) for i in range(5)]})
print(df1)
OR:
df1 = pd.DataFrame([{'x':i, 'file': "data_{i}.txt".format(i=i)} for i in range(5)])
print(df1)
I've found success with the .at method
for i in range(5):
print("data_{i}.txt".format(i=i)) # this prints the value that I want
df1.at[i, 'file'] = "data_{i}.txt".format(i=i)
Returns:
x file
0 0 data_0.txt
1 1 data_1.txt
2 2 data_2.txt
3 3 data_3.txt
4 4 data_4.txt
when you assign a variable to a dataframe column the way you do -
using the df['colname'] = 'val', it assigns the val across all rows.
That is why you are seeing only the last value.
Change your code to:
import pandas as pd
df1 = pd.DataFrame({'x':range(5)})
# for loop to add a row with an index
to_assign = []
for i in range(5):
print("data_{i}.txt".format(i=i)) # this prints the value that I want
to_assign.append(data_{i}.txt".format(i=i))
##outside of the loop - only once - to all dataframe rows
df1['file'] = to_assign.
As a thought, pandas has a great API for performing these type of actions without for loops.
You should start practicing those.

how to determine if a cell has multiple values and count the number of occurences

I have a table as below where i need to count the number of times the type column has more than one value in it.
My logic at the moment is to go through each time and check if the type cell has more than one value in it and place a counter but i am not sure how to code this in Python correctly.
I tried this method below but i don't think it helps in my case considering that it is also hierarchical:
from collections import Counter
Counter(pd.DataFrame(data['Country'].str.split(',', expand=True)).values.ravel())
You can do:
## df is your data (gives pandas series)
df['type'].apply(lambda x: len(str(x).split(','))).value_counts()
## or convert it to dict
df['type'].apply(lambda x: len(str(x).split(','))).value_counts().to_dict()
Using get_dummies with sum
df=pd.DataFrame({'type':['big,green','big','small,red']})
df.type.str.get_dummies(sep=',').sum(1)
Out[382]:
0 2
1 1
2 2
dtype: int64
Maybe you should try this one:
df=pd.DataFrame({'type':['big,green','big','small,red']})
for i in df['type']: print(len(i.split(',')))

Using Python, Pandas and Apply/Lambda, how can I write a function which creates multiple new columns?

Apologies for the messy title: Problem as follows:
I have some data frame of the form:
df1 =
Entries
0 "A Level"
1 "GCSE"
2 "BSC"
I also have a data frame of the form:
df2 =
Secondary Undergrad
0 "A Level" "BSC"
1 "GCSE" "BA"
2 "AS Level" "MSc"
I have a function which searches each entry in df1, looking for the words in each column of df2. The words that match, are saved (Words_Present):
def word_search(df,group,words):
Yes, No = 0,0
Words_Present = []
for i in words:
match_object = re.search(i,df)
if match_object:
Words_Present.append(i)
Yes = 1
else:
No = 0
if Yes == 1:
Attribute = 1
return Attribute
I apply this function over all entries in df1, and all columns in df2, using the following iteration:
for i in df2:
terms = df2[i].values.tolist()
df1[i] = df1['course'][0:1].apply(lambda x: word_search(x,i,terms))
This yields an output df which looks something like:
df1 =
Entries Secondary undergrad
0 "A Level" 1 0
1 "GCSE" 1 0
2 "AS Level" 1 0
I want to amend the Word_Search function to output a the Words_Present list as well as the Attribute, and input these into a new column, so that my eventual df1 array looks like:
Desired dataframe:
Entries Secondary Words Found undergrad Words Found
0 "A Level" 1 "A Level" 0
1 "GCSE" 1 "GCSE" 0
2 "AS Level" 1 "AS Level" 0
If I do:
def word_search(df,group,words):
Yes, No = 0,0
Words_Present = []
for i in words:
match_object = re.search(i,df)
if match_object:
Words_Present.append(i)
Yes = 1
else:
No = 0
if Yes == 1:
Attribute = 1
if Yes == 0:
Attribute = 0
return Attribute,Words_Present
My function therefore now has multiple outputs. So applying the following:
for i in df2:
terms = df2[i].values.tolist()
df1[i] = df1['course'][0:1].apply(lambda x: word_search(x,i,terms))
My Output Looks like this:
Entries Secondary undergrad
0 "A Level" [1,"A Level"] 0
1 "GCSE" [1, "GCSE"] 0
2 "AS Level" [1, "AS Level"] 0
The output of pd.apply() is always a pandas series, so it just shoves everything into the single cell of df[i] where i = secondary.
Is it possible to split the output of .apply into two separate columns, as shown in the desired dataframe?
I have consulted many questions, but none seem to deal directly with yielding multiple columns when the function contained within the apply statement has multiple outputs:
Applying function with multiple arguments to create a new pandas column
Create multiple columns in Pandas Dataframe from one function
Apply pandas function to column to create multiple new columns?
For example, I have also tried:
for i in df2:
terms = df2[i].values.tolist()
[df1[i],df1[i]+"Present"] = pd.concat([df1['course'][0:1].apply(lambda x: word_search(x,i,terms))])
but this simply yields errors such as:
raise ValueError('Length of values does not match length of ' 'index')
Is there a way to use apply, but still extract the extra information directly into multiple columns?
Many thanks, apologies for the length.
The direct answer to your question is yes: use the apply method of the DataFrame object, so you'd be doing df1.apply().
However, for this problem, and anything in pandas in general, try to vectorise rather than iterate through rows -- it's faster and cleaner.
It looks like you are trying to classify Entries into Secondary or Undergrad, and saving the keyword used to make the match. If you assume that each element of Entries has no more than one keyword match (i.e. you won't run into 'GCSE A Level'), you can do the following:
df = df1.copy()
df['secondary_words_found'] = df.Entries.str.extract('(A Level|GCSE|AS Level)')
df['undergrad_words_found'] = df.Entries.str.extract('(BSC|BA|MSc)')
df['secondary'] = df.secondary_words_found.notnull() * 1
df['undergrad'] = df.undergrad_words_found.notnull() * 1
EDIT:
In response to your issue with having many more categories and keywords, you can continue in the spirit of this solution by using an appropriate for loop and doing '(' + '|'.join(df2['Undergrad'].values) + ')' inside the extract method.
However, if you have exact matches, you can do everything by a combination of pivots and joins:
keywords = df2.stack().to_frame('Entries').reset_index().drop('level_0', axis = 1).rename(columns={'level_1':'category'})
df = df1.merge(keywords, how = 'left')
for colname in df.category:
df[colname] = (df.Entries == colname) * 1 # Your indicator variable
df.loc[df.category == colname, colname + '_words_found'] = df.loc[df.category == colname, 'Entries']
The first line 'pivots' your table of keywords into a 2-column dataframe of keywords and categories. Your keyword column must be the same as the column in df1; in SQL, this would be called the foreign key that you are going to join these tables on.
Also, you generally want to avoid having duplicate indexes or columns, which in your case, was Words Found in the desired dataframe!
For the sake of completeness, if you insisted on using the apply method, you would iterate over each row of the DataFrame; your code would look something like this:
secondary_words = df2.Secondary.values
undergrad_words = df2.Undergrad.values
def(s):
if s.Entries.isin(secondary_words):
return pd.Series({'Entries':s.Entries, 'Secondary':1, 'secondary_words_found':s.Entries, 'Undergrad':0, 'undergrad_words_found':''})
elif s.Entries.isin(undergrad_words ):
return pd.Series({'Entries':s.Entries, 'Secondary':0, 'secondary_words_found':'', 'Undergrad':1, 'undergrad_words_found':s.Entries})
else:
return pd.Series({'Entries':s.Entries, 'Secondary':0, 'secondary_words_found':'', 'Undergrad':0, 'undergrad_words_found':''})
This second version will only work in the cases you want it to if the element in Entries is exactly the same as its corresponding element in df2. You certainly don't want to do this, as it's messier, and will be noticeably slower if you have a lot of data to work with.

Categories

Resources