pandas groupby filter nunique - python

So far an example of what I have is here:
df = pd.DataFrame({"barcode": [1,2,2,3,3,4, 4, 4], "date": ['today', 'today', 'tomorrow', 'tomorrow', 'tomorrow', 'yesterday', 'yesterday' ,'yesterday'], "info": [40,20,10,15,17,19, 21, 23]})
gb= df.groupby(['date'])
gb.filter(lambda x: x['barcode'].nunique!=1)
which returns:
Empty DataFrame
Columns: [barcode, date, info]
Index: []
Only "yesterday" should remain after I filter this because there are 2 distinct barcodes in the group "today", and 2 distinct barcodes in the group "tomorrow". What is going on here? and in the example the column to filter on is sorted but does it need to be?

I will recommend
gb= df.groupby(['date'])
df = df[gb['barcode'].transform('nunqiue').eq(1)]

nunique is a method, not a property. Fix:
gb.filter(lambda x: x['barcode'].nunique() ==1)

Related

PySpark: Create a subset of a dataframe for all dates

I have a DataFrame that has a lot of columns and I need to create a subset of that DataFrame that has only date values.
For e.g. my Dataframe could be:
1, 'John Smith', '12/10/1982', '123 Main St', '01/01/2000'
2, 'Jane Smith', '11/21/1999', 'Abc St', '12/12/2020'
And my new DataFrame should only have:
'12/10/1982', '01/01/2000'
'11/21/1999', '12/12/2000'
The dates could be of any format and could be on any column. I can use the dateutil.parser to parse them to make sure they are dates. But not sure how to call parse() on all the columns and only filter those that return true to another dataframe, easily.
If you know what you columns the datetimes are in it's easy:
pd2 = pd[["row_name_1", "row_name_2"]]
# or
pd2 = pd.iloc[:, [2, 4]]
You can find your columns' datatype by checking each tuple in your_dataframe.dtypes.
schema = "id int, name string, date timestamp, date2 timestamp"
df = spark.createDataFrame([(1, "John", datetime.now(), datetime.today())], schema)
list_of_columns = []
for (field_name, data_type) in df.dtypes:
if data_type == "timestamp":
list_of_columns.append(field_name)
Now you can use this list inside .select()
df_subset_only_timestamps = df.select(list_of_columns)
EDIT: I realized your date columns might be StringType.
You could try something like:
df_subset_only_timestamps = df.select([when(col(column).like("%/%/%"), col(column)).alias(column) for column in df.columns]).na.drop()
Inspired by this answer. Let me know if it works!

About Pandas Dataframe

I have a question related to Pandas.
In df1 I have a data frame with the id of each seller and their respective names.
In df2 I have the id of the salesmen and their respective sales.
I would like to have in the df2, two new columns with the first name and last names of the salesmen.
PS. in df2 one of the sales is shared between two vendors.
import pandas as pd
vendors = {'first_name': ['Montgomery', 'Dagmar', 'Reeba', 'Shalom', 'Broddy', 'Aurelia'],
'last_name': ['Humes', 'Elstow', 'Wattisham', 'Alen', 'Keningham', 'Brechin'],
'id_vendor': [127, 241, 329, 333, 212, 233]}
sales = {'id_vendor': [['127'], ['241'], ['329, 333'], ['212'], ['233']],
'sales': [1233, 25000, 8555, 4333, 3222]}
df1 = pd.DataFrame(vendors)
df2 = pd.DataFrame(sales)
I attach the code. any suggestions?`
Thank you in advance.
You can merge df1 with df2 with the exploded id_vendors column and use DataFrame.GroupBy.agg when grouping by sales to obtain the columns as you want:
transform_names = lambda x: ', '.join(list(x))
res = (df1.merge(df2.explode('id_vendor')).
groupby('sales').
agg({'first_name': transform_names, 'last_name': transform_names,
'id_vendor': list})
)
print(res)
first_name last_name id_vendor
sales
1233 Montgomery Humes [127]
3222 Aurelia Brechin [233]
4333 Broddy Keningham [212]
8555 Reeba, Shalom Wattisham, Alen [329, 333]
25000 Dagmar Elstow [241]
Note:
In your example, id_vendors in df2 is populated by lists of strings, but since id_vendor in df1 is of integer type, I assume that it was a typo. If id_vendors is indeed containing lists of strings, you need to also convert the strings to integers:
transform_names = lambda x: ', '.join(list(x))
# Notice the .astype(int) call.
res = (df1.merge(df2.explode('id_vendor').astype(int)).
groupby('sales').
agg({'first_name': transform_names, 'last_name': transform_names,
'id_vendor': list})
)
print(res)

Get column names with corresponding index in python pandas

I have this dataframe df where
>>> df = pd.DataFrame({'Date':['10/2/2011', '11/2/2011', '12/2/2011', '13/2/11'],
'Event':['Music', 'Poetry', 'Theatre', 'Comedy'],
'Cost':[10000, 5000, 15000, 2000],
'Name':['Roy', 'Abraham', 'Blythe', 'Sophia'],
'Age':['20', '10', '13', '17']})
I want to determine the column index with the corresponding name. I tried it with this:
>>> list(df.columns)
But the solution above only returns the column names without index numbers.
How can I code it so that it would return the column names and the corresponding index for that column? Like This:
0 Date
1 Event
2 Cost
3 Name
4 Age
Simpliest is add pd.Series constructor:
pd.Series(list(df.columns))
Or convert columns to Series and create default index:
df.columns.to_series().reset_index(drop=True)
Or:
df.columns.to_series(index=False)
You can use loop like this:
myList = list(df.columns)
index = 0
for value in myList:
print(index, value)
index += 1
A nice short way to get a dictionary:
d = dict(enumerate(df))
output: {0: 'Date', 1: 'Event', 2: 'Cost', 3: 'Name', 4: 'Age'}
For a Series, pd.Series(list(df)) is sufficient as iteration occurs directly on the column names
In addition to using enumerate, this also can get a numbers in order using zip, as follows:
import pandas as pd
df = pd.DataFrame({'Date':['10/2/2011', '11/2/2011', '12/2/2011', '13/2/11'],
'Event':['Music', 'Poetry', 'Theatre', 'Comedy'],
'Cost':[10000, 5000, 15000, 2000],
'Name':['Roy', 'Abraham', 'Blythe', 'Sophia'],
'Age':['20', '10', '13', '17']})
result = list(zip([i for i in range(len(df.columns))], df.columns.values,))
for r in result:
print(r)
#(0, 'Date')
#(1, 'Event')
#(2, 'Cost')
#(3, 'Name')
#(4, 'Age')

Row wise operation in Pandas DataFrame

I have a Dataframe as
import pandas as pd
df = pd.DataFrame({
"First": ['First1', 'First2', 'First3'],
"Secnd": ['Secnd1', 'Secnd2', 'Secnd3']
)
df.index = ['Row1', 'Row2', 'Row3']
I would like to have a lambda function in apply method to create a list of dictionary (including index item) as below
[
{'Row1': ['First1', 'Secnd1']},
{'Row2': ['First2', 'Secnd2']},
{'Row3': ['First3', 'Secnd3']},
]
If I use something like .apply(lambda x: <some operation>) here, x does not include the index rather the values.
Cheers,
DD
To expand Hans Bambel's answer to get the exact desired output:
[{k: list(v.values())} for k, v in df.to_dict('index').items()]
You don't need apply here. You can just use the to_dict() function with the "index" argument:
df.to_dict("index")
This gives the output:
{'Row1': {'First': 'First1', 'Secnd': 'Secnd1'},
'Row2': {'First': 'First2', 'Secnd': 'Secnd2'},
'Row3': {'First': 'First3', 'Secnd': 'Secnd3'}}

Selection of columns

I work with Pandas dataframe.I want to aggregate data by one column and after that to summarize other columns.You can see example below:
data = {'name': ['Company1', 'Company2', 'Company1', 'Company2', 'Company5'],
'income': [0, 180395, 4543168, 7543168, 73],
'turnover': [4, 24, 31, 2, 3]}
df = pd.DataFrame(data, columns = ['name', 'income', 'turnover'])
df
INCOME_GROUPED = df.groupby(['name']).agg({'income':sum,'turnover':sum})
So this code above work well and give good result. Now next step is selection. I want to select only to columns from INCOME_GROUPED dataframe.
INCOME_SELECT = INCOME_GROUPED[['name','income']]
But after execution this line of code I got this error:
"None of [Index(['name', 'income'], dtype='object')] are in the [columns]"
So can anybody help me how to solve this problem ?
You need to call reset_index() after agg():
INCOME_GROUPED = df.groupby(['name']).agg({'income':sum,'turnover':sum}).reset_index()
# ^^^^^^^^^^^^^^ add this
Output:
>>> INCOME_GROUPED[['name', 'income']]
name income
0 Company1 4543168
1 Company2 7723563
2 Company5 73
In this line
INCOME_SELECT = WAGE_GROUPED[['name','income']]
you probably meant INCOME_GROUPED instead of WAGE_GROUPED. Also, uppercase variable names are frowned upon unless global constants.
This only solves "income" being available as a column. By default, groupby makes the grouper column the index in the result; this is changable by passing as_index=False to it:
income_grouped = (df.groupby("name", as_index=False)
.agg({"income": "sum", "turnover": "sum"})
income_selected = income_grouped[["name", "income"]].copy()
Note also the copy at the very end to avoid the infamous SettingWithCopyWarning in possible future manipulations.
Last note is that since you aggregated all columns and did it with the same method ("sum"), you can go for
income_grouped = df.groupby("name", as_index=False).sum()

Categories

Resources