Manipulating DataFrames in a function - python

I need to make a function that can act on any dataframe and perform an action on it.
To clarify, for example let's say I have this sample dataframe here:
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
Which looks like this.
a b c
0 1 2 3
1 4 5 6
2 7 8 9
I have created a function that does something of this sort:
def ColDrop(df, collist):
> df=df.drop(columns = collist)
> return df
(assume > as indent)
What I'd like is for it to accept a list as the 'collist' variable and drop all of those from the dataframe stated as 'df', so...
col = ['a', 'b']
ColDrop(df, col)
Would look like...
c
0 3
1 6
2 9
However, it doesn't seem to work. Similarly I want to remove values from any dataframe based on its row, for example...
def rowvaluedrop(df, column, pattern):
> filter = df[column].str.contains(pattern)
> df = df[~filter]
> return df
rowvaluedrop(df, a, 4)
Would look like...
a b c
0 1 2 3
2 7 8 9
(i realise this second example may not work since the values are integers rather than strings, but i hope that my point gets across regardless.)
Thanks in advance.

You need to store the returning dataframe back to df implicitly
df = rowvaluedrop(df, a, 4)

Related

Extract rows with repeats, from dataframe where column value matches value from an array

I have a pandas.Dataframe df with one of the column headers being 'X'. Let's say this is of size (N,M). N=3,M=2 in this example:
X Y
0 1 a
1 2 b
2 3 c
I have a 1D numpy.array arr of size (Q,), that contains values, some of which are repeats. Q=5 in this example:
array([1, 2, 3, 2, 2])
I would like to create a new pandas.Dataframe df_op that contains rows from df, where each row.X matches an entry from arr. This means some rows are extracted more than once, and the resultant df_op has size (Q,M). If possible, I would like to keep the same order of entries as in arr as well.
X Y
0 1 a
1 2 b
2 3 c
3 2 b
4 2 b
Using the usual boolean indexing does not work, because that only picks up unique rows. I would also like to avoid loops if possible, because Q is large.
How can I get df_op? Thank you.
Use indexing to get multiple times the same row:
x = [1, 2, 3, 2, 2]
df = pd.DataFrame({'X': [1, 2, 3], 'Y': ['a', 'b', 'c']})
out = df.set_index('X').loc[x].reset_index()
Output:
>>> out
X Y
0 1 a
1 2 b
2 3 c
3 2 b
4 2 b

Function Value with Combination(or Permutation) of Variables and Assign to Dataframe

I have n variables. Suppose n equals 3 in this case. I want to apply one function to all of the combinations(or permutations, depending on how you want to solve this) of variables and store the result in the same row and column in dataframe.
a = 1
b = 2
c = 3
indexes = ['a', 'b', 'c']
df = pd.DataFrame({x:np.nan for x in indexes}, index=indexes)
If I apply sum(the function can be anything), then the result that I want to get is like this:
a b c
a 2 3 4
b 3 4 5
c 4 5 6
I can only think of iterating all the variables, apply the function one by one, and use the index of the iterators to set the value in the dataframe. Is there any better solution?
You can use apply and return a pd.Series for that effect. In such cases, pandas uses the series indices as columns in the resulting dataframe.
s = pd.Series({"a": 1, "b": 2, "c": 3})
s.apply(lambda x: x+s)
Just note that the operation you do is between an element and a series.
I believe you need broadcast sum of array created from variables if performance is important:
a = 1
b = 2
c = 3
indexes = ['a', 'b', 'c']
arr = np.array([a,b,c])
df = pd.DataFrame(arr + arr[:, None], index=indexes, columns=indexes)
print (df)
a b c
a 2 3 4
b 3 4 5
c 4 5 6

A straightforword method to select columns by position

I am trying to clarify how can I manage pandas methods to call columns and rows in a Dataframe. An example will clarify my issue
dic = {'a': [1, 5, 2, 7], 'b': [6, 8, 4, 2], 'c': [5, 3, 2, 7]}
df = pd.DataFrame(dic, index = ['e', 'f', 'g', 'h'] )
than
df =
a b c
e 1 6 5
f 5 8 3
g 2 4 2
h 7 2 7
Now if I want to select column 'a' I just have to type
df['a']
while if I want to select row 'e' I have to use the ".loc" method
df.loc['e']
If I don't know the name of the row, but just it's position ( 0 in this case) than I can use the "iloc" method
df.iloc[0]
What looks like it is missing is a method for calling columns by position and not by name, something that is the "equivalent for columns of the 'iloc' method for rows". The only way I can find to do this is
df[df.keys()[0]]
is there something like
df.ilocColumn[0]
?
You can add : because first argument is position of selected indexes and second position of columns in function iloc:
And : means all indexes in DataFrame:
print (df.iloc[:,0])
e 1
f 5
g 2
h 7
Name: a, dtype: int64
If need select first index and first column value:
print (df.iloc[0,0])
1
Solution with ix work nice if need select index by name and column by position:
print (df.ix['e',0])
1

how to add columns label on a Pandas DataFrame

I can't understand how can I add column names on a pandas dataframe, an easy example will clarify my issue:
dic = {'a': [4, 1, 3, 1], 'b': [4, 2, 1, 4], 'c': [5, 7, 9, 1]}
df = pd.DataFrame(dic)
now if I type df than I get
a b c
0 4 4 5
1 1 2 7
2 3 1 9
3 1 4 1
say now that I generate another dataframe just by summing up the columns on the previous one
a = df.sum()
if I type 'a' than I get
a 9
b 11
c 22
That looks like a dataframe without with index and without names on the only column. So I wrote
a.columns = ['column']
or
a.columns = ['index', 'column']
and in both cases Python was happy because he didn't provide me any message of errors. But still if I type 'a' I can't see the columns name anywhere. What's wrong here?
The method DataFrame.sum() does an aggregation and therefore returns a Series, not a DataFrame. And a Series has no columns, only an index. If you want to create a DataFrame out of your sum you can change a = df.sum() by:
a = pandas.DataFrame(df.sum(), columns = ['whatever_name_you_want'])

Merging and sum up several value-counts series in Pandas

I usually use value_counts() to get the number of occurrences of a value. However, I deal now with large database-tables (cannot load it fully into RAM) and query the data in fractions of 1 month.
Is there a way to store the result of value_counts() and merge it with / add it to the next results?
I want to count the number user actions. Assume the following structure of
user-activity logs:
# month 1
id userId actionType
1 1 a
2 1 c
3 2 a
4 3 a
5 3 b
# month 2
id userId actionType
6 1 b
7 1 b
8 2 a
9 3 c
Using value_counts() on those produces:
# month 1
userId
1 2
2 1
3 2
# month 2
userId
1 2
2 1
3 1
Expected output:
# month 1+2
userId
1 4
2 2
3 3
Up until now, I just have found a method using groupby and sum:
# count users actions and remember them in new column
df1['count'] = df1.groupby(['userId'], sort=False)['id'].transform('count')
# delete not necessary columns
df1 = df1[['userId', 'count']]
# delete not necessary rows
df1 = df1.drop_duplicates(subset=['userId'])
# repeat
df2['count'] = df2.groupby(['userId'], sort=False)['id'].transform('count')
df2 = df2[['userId', 'count']]
df2 = df2.drop_duplicates(subset=['userId'])
# merge and sum up
print pd.concat([df1,df2]).groupby(['userId'], sort=False).sum()
What is the pythonic / pandas' way of merging the information of several series' (and dataframes) efficiently?
Let me suggest "add" and specify a fill value of 0. This has an advantage over the previously suggested answer in that it will work when the two Dataframes have non-identical sets of unique keys.
# Create frames
df1 = pd.DataFrame(
{'User_id': ['a', 'a', 'b', 'c', 'c', 'd'], 'a': [1, 1, 2, 3, 3, 5]})
df2 = pd.DataFrame(
{'User_id': ['a', 'a', 'b', 'b', 'c', 'c', 'c'], 'a': [1, 1, 2, 2, 3, 3, 4]})
Now add the the two sets of values_counts(). The fill_value argument will handle any NaN values that would arise, in this example, the 'd' that appears in df1, but not df2.
a = df1.User_id.value_counts()
b = df2.User_id.value_counts()
a.add(b,fill_value=0)
You can sum the series generated by the value_counts method directly:
#create frames
df= pd.DataFrame({'User_id': ['a','a','b','c','c'],'a':[1,1,2,3,3]})
df1= pd.DataFrame({'User_id': ['a','a','b','b','c','c','c'],'a':[1,1,2,2,3,3,4]})
sum the series:
df.User_id.value_counts() + df1.User_id.value_counts()
output:
a 4
b 3
c 5
dtype: int64
This is know as "Split-Apply-Combine". It is done in 1 line and 3-4 clicks, using a lambda function as follows.
1️⃣ paste this into your code:
df['total_for_this_label'] = df.groupby('label', as_index=False)['label'].transform(lambda x: x.count())
2️⃣ replace 3x label with the name of the column whose values you are counting (case-sensitive)
3️⃣ print df.head() to check it's worked correctly

Categories

Resources