Let's say I have a dataframe formatted the following way:
id | name | 052017 | 062017 | 072017 | 092017 | 102017
20 | abcd | 0 | 100 | 200 | 50 | 0
I need to retrieve the column name of the last month an organization had any transactions. In this case, I would like to add a column called "date_string" that would have 092017 as its contents.
Any way to achieve this?
Thanks!
replace 0 to np.nan then using last_valid_index
df.replace(0,np.nan).apply(lambda x :x.last_valid_index(),1)
Out[602]:
0 092017
dtype: object
#df['newcol'] = df.replace(0,np.nan).apply(lambda x :x.last_valid_index(),1)
Related
I have a dataframe (df):
| A | B | C |
| --- | ----- | ----------------------- |
| CA | Jon | [sales, engineering] |
| NY | Sarah | [engineering, IT] |
| VA | Vox | [services, engineering] |
I am trying to group by each item in the C column list (sales, engineering, IT, etc.).
Tried:
df.groupby('C')
but got list not hashable, which is expected. I came across another post where it was recommended to convert the C column to tuple which is hashable, but I need to groupby each item and not the combination.
My goal is to get the count of each row in the df for each item in the C column list. So:
sales: 1
engineering: 3
IT: 1
services: 1
While there is probably a simpler way to obtain this than using groupby, I am still curious if groupby can be used in this case.
You can explode & value_counts :
out = df.explode("C").value_counts("C")
Output :
print(out)
C
engineering 3
IT 1
sales 1
services 1
dtype: int64
I have a dataframe like this (simplified):
| | amount | other_amt | rule_id |
|---:|:--------|:----------|---------:|
| 0 | 2 | 0 | 101 |
| 1 | 20 | 0.5 | 102 |
| 2 | 300 | 0 | 0 |
| 3 | 50 | 1 | 101 |
I then have a set of functions that apply each of these rules to the data, such as:
def rule_101(df):
return df['amount'] / 2
def rule_102(df):
return df['other_amt']
I want to create a new column where I apply each rule_xxx(df) function, depending on what's in the rule_id column. And I use the content of the rule_id column to call the function within the command that creates the new column. Something like
df['new_col'] = np.where(df['rule_id'] == '0',
df['amount']),
locals()[f'rule_{df.rule_id}'](df))
This bit f'rule_{df.rule_id}' is what's causing me trouble. It returns the full series and thus an error, like
KeyError: 'rule_0 0\n1 0\n2 0\n3 0\n4 0\n ..\n495 0\n496 0\n497 0\n498 0\n499 0\nName: rule_id, Length: 500, dtype: object'
How can I "align" these two inputs? So that the value in rule_id for each row gets inserted in the f-string, thus calling the function for that specific rule_id on that specific row?
Other approaches are also welcome of course, as long as I'm able to apply the function corresponding to the rule_id in each row. Thanks a lot
You can use a dictionary to look up the rules:
def rule_101(df):
return df['amount'] / 2
def rule_102(df):
return df['other_amt']
ruleset = {
0: lambda k: 0,
101: rule_101,
102: rule_102
}
def rules(row):
return ruleset[row['rule_id']](row)
df['new_col'] = df.apply(rules, axis=1)
I want to get all the possible combinations of size 2 of a column in pyspark dataframe.
My pyspark dataframe looks like
| id |
| 1 |
| 2 |
| 3 |
| 4 |
For above input, I want to get output as
| id1 | id2 |
| 1 | 2 |
| 1 | 3 |
| 1 | 4 |
| 2 | 3 |
and so on..
One way would be to collect the values and get them into python iterable (list, pandas df) and use itertools.combinations to generate all combinations.
values = df.select(F.collect_list('id')).first()[0]
combns = list(itertools.combinations(values, 2))
However, I want to avoid collecting the dataframe column to the driver since the rows can be extremely large. Is there a better way to achieve this using spark APIs?
You can use the crossJoin method, and then cull the lines with id1 > id2.
df = df.toDF('id1').crossJoin(df.toDF('id2')).filter('id1 < id2')
Suppose the data frame below:
|id |day | order |
|---|--- |-------|
| a | 2 | 6 |
| a | 4 | 0 |
| a | 7 | 4 |
| a | 8 | 8 |
| b | 11 | 10 |
| b | 15 | 15 |
I want to apply a function to day and order column of each group by rows on id column.
The function is:
def mean_of_differences(my_list):
return sum([ my_list[i] - my_list[i-1] for i in range(1, len(my_list))]) / len(my_list)
This function calculates mean of differences of each element and the next one. For example, for id=a, day would be 2+3+1 divided by 4. I know how to use lambda, but didn't find a way to implement this in a pandas group by. Also, each column should be ordered to get my desired output, so apparently it is not possible to sort by one column before group by
The output should be like this:
|id |day| order |
|---|---|-------|
| a |1.5| 2 |
| b | 2 | 2.5 |
Any one know how to do so in a group by?
First, sort your data by day then group by id and finally compute your diff/mean.
df = df.sort_values('day') \
.groupby('id') \
.agg({'day': lambda x: x.diff().fillna(0).mean()}) \
.reset_index()
Output:
>>> df
id day
0 a 1.5
1 b 2.0
I'm sure what I'm trying to do is fairly simple for those with better knowledge of PD, but I'm simply stuck at transforming:
+---------+------------+-------+
| Trigger | Date | Value |
+---------+------------+-------+
| 1 | 01/01/2016 | a |
+---------+------------+-------+
| 2 | 01/01/2016 | b |
+---------+------------+-------+
| 3 | 01/01/2016 | c |
+---------+------------+-------+
...etc, into:
+------------+---------------------+---------+---------+---------+
| Date | #of triggers | count a | count b | count c |
+------------+---------------------+---------+---------+---------+
| 01/01/2016 | 3 | 1 | 1 | 1 |
+------------+---------------------+---------+---------+---------+
| 02/01/2016 | 5 | 2 | 1 | 2 |
+------------+---------------------+---------+---------+---------+
... and so on
The issue is, I've got no bloody idea of how to achieve this..
I've scoured SO, but I can't seem to find anything that applies to my specific case.
I presume I'd have to group it all by date, but then once that is done, what do I need to do to get the remaining columns?
The initial DF is loaded from an SQL Alchemy query object, and then I want to manipulate it to get the result I described above. How would one do this?
Thanks
df.groupby(['Date','Value']).count().unstack(level=-1)
You can use GroupBy.size with unstack, also parameter sort=False is helpful:
df1 = df.groupby(['Date','Value'])['Value'].size().unstack(fill_value=0)
df1['Total'] = df1.sum(axis=1)
cols = df1.columns[-1:].union(df1.columns[:-1])
df1 = df1[cols]
print (df1)
Value Total a b c
Date
01/01/2016 3 1 1 1
The difference between size and count is:
size counts NaN values, count does not.