extract unique values and make a new dataframe on a condition - python

Assume this is my sample input df:
date h_league
0 19901126 AA
1 19911127 NA
2 20030130 AA
3 20041217 NaN
4 20080716 AA
5 20011215 NA
6 19970603 NaN
I'm looking to extract unique leagues from h_league and also make new two cols one is max_date and has maximum date and min_date that has minimum date for the league.
# Desired Output:
h_league Max_date Min_date
0 AA 20080716 19901126
1 NA 20011215 19911127
I had to write a function for this task that returns similar output that I desire but not the exact desired output.
def league_info(league):
league_games = df[df["h_league"] == league]
earliest = df["date"].min()
latest = df["date"].max()
print("{} went from {} to {}".format(league,earliest,latest))
for league in df["h_league"].unique():
league_info(league)
I'm looking for a pandas way to achieve my desired output. Any help is appreciated. Thank you!

IIUC
df=df.fillna('NA')
df.groupby('h_league').date.agg(['max','min'])
Out[98]:
max min
h_league
AA 20080716 19901126
NA 20041217 19911127

df2=df.fillna('NA')
df2.groupby('h_league').date.agg(['max','min'])
Does this work for you? You can assign df=df.fillna('NA') too. let me know if this works. I tried it.

Related

Divide DataFrame Column on (,) into two new columns

I have a pandas DataFrame called data_combined with the following structure:
index corr_year corr_5d
0 (DAL, AAL) 0.873762 0.778594
1 (WEC, ED) 0.851578 0.850549
2 (CMS, LNT) 0.850028 0.776143
3 (SWKS, QRVO) 0.850799 0.830603
4 (ALK, DAL) 0.874162 0.744590
Now I am trying to divide the column named index into two columns on the (,).
The desired output should look like this:
index1 index2 corr_year corr_5d
0 DAL AAL 0.873762 0.778594
1 WEC ED 0.851578 0.850549
2 CMS LNT 0.850028 0.776143
3 SWKS QRVO 0.850799 0.830603
4 ALK DAL 0.874162 0.744590
I have tried using pd.explode() with the following code
data_results_test = data_results_combined.explode('index')
data_results_test
Which leads to the following output:
index corr_year corr_5d
0 DAL 0.873762 0.778594
0 AAL 0.873762 0.778594
1 WEC 0.851578 0.850549
1 ED 0.851578 0.850549
How can I achieve the split with newly added columns instead of rows. pd.explode does not seem to have any option to choose wether to add new rows or columns
How about a simple apply? (Assuming 'index' column is a tuple)
data_results_combined['index1'] = data_results_combined['index'].apply(lambda x: x[0])
data_results_combined['index2'] = data_results_combined['index'].apply(lambda x: x[1])
df[['index1','index2']] = df['index'].str.split(',',expand=True)

How to compare each date in a cell with all the dates in a column

I have a dataframe with three columns lets say
Name Address Date
faraz xyz 2022-01-01
Abdul abc 2022-06-06
Zara qrs 2021-02-25
I want to compare each date in Date column with all the other dates in the Date column and only keep those rows which lie within 6 months of atleast one of all the dates.
for example: (2022-01-01 - 2022-06-06) = 5 months so we keep both these dates
but,
(2022-06-06 - 2021-02-25) and (2022-01-01 - 2021-02-25) exceed the 6 month limit
so we will drop that row.
Desired Output:
Name Address Date
faraz xyz 2022-01-01
Abdul abc 2022-06-06
I have tried a couple of approches such a nested loops, but I got 1 million+ entries and it takes forever to run that loop. Some of the dates repeat too. Not all are unique.
for index, row in dupes_df.iterrows():
for date in uniq_dates_list:
format_date = datetime.strptime(date,'%d/%m/%y')
if (( format_date.year - row['JournalDate'].year ) * 12 + ( format_date.month - row['JournalDate'].month ) <= 6):
print("here here")
break
else:
dupes_df.drop(index, inplace=True)
I need a much more omptimal solution for it. Studied about lamba functions, but couldn't get to the depths of it.
IIUC, this should work for you:
import pandas as pd
import itertools
from io import StringIO
data = StringIO("""Name;Address;Date
faraz;xyz;2022-01-01
Abdul;abc;2022-06-06
Zara;qrs;2021-02-25
""")
df = pd.read_csv(data, sep=';', parse_dates=['Date'])
df_date = pd.DataFrame([sorted(l, reverse=True) for l in itertools.combinations(df['Date'], 2)], columns=['Date1', 'Date2'])
df_date['diff'] = (df_date['Date1'] - df_date['Date2']).dt.days
df[df.Date.isin(df_date[df_date['diff'] <= 180].iloc[:, :-1].T[0])]
Output:
Name Address Date
0 faraz xyz 2022-01-01
1 Abdul abc 2022-06-06
First I think it's be easier if you use 'relativedelta' from 'dateutil'.
Reference: https://pynative.com/python-difference-between-two-dates-in-months/
Second, I think you need to add a column, let's call it score.
At the second loop, if delta <= 6 month :
set score = 1 and 'continue'
This way each row is compared to all rows.
Delete all rows that have score == 0.

How to count the occurrences of a string starts with a specific substring from comma separated values in a pandas data frame?

I am new to Python. I am working with a dataframe (360000 rows and 2 columns) that looks something like this:
business_id date
P01 2019-07-6 , 2018-06-05, 2019-07-06...
P02 2016-03-6 , 2019-04-10
P03 2019-01-02
The date column has dates separated by comma and dates from year 2010-2019. I am trying to count only the dates for each month that are in year 2019 for each business id. Specifically, I am looking for the output:
Can anyone please help me? Thanks.
You can do as follows
first use str.split to separate the dates in each cell to a list,
then explode to flatten the lists
convert to datetime with pd.to_datetime and extract the month
finally use pd.crosstab to pivot/count the months and join.
Altogether:
s = pd.to_datetime(df['date'].str.split('\s*,\s*').explode()).dt.to_period('M')
out = pd.crosstab(s.index,s )
# this gives the expected output
df.join(out)
Output (out):
date 2016-03 2018-06 2019-01 2019-04 2019-07
row_0
0 0 1 0 0 2
1 1 0 0 1 0
2 0 0 1 0 0
If they are not datetime objects yet, you may want to start by converting the column (series) to datetime:
pd.to_datetime()
Note: the format parameter.
Then you can access the datetime attributes through .dt
i.e df[df.COLUMN_NAME.dt.month == 5]

Using regex to create new column in dataframe

I have a dataframe and in one of its columns i need to pull out specific text and place it into its own column. From the dataframe below i need to take elements of the LAUNCH column and add that into its own column next to it, specifically i need to extract the date in the rows which provide it, for example 'Mar-24'.
df =
|LAUNCH
0|Step-up Mar-24:x1.5
1|unknown
2|NTV:62.1%
3|Step-up Aug-23:N/A,
I would like the output to be something like this:
df =
|LAUNCH |DATE
0|Step-up Mar-24:x1.5 | Mar-24
1|unknown | nan
2|NTV:62.1% | nan
3|Step-up Aug-23:N/A, | Aug-23
And if this can be done, would it also be possible to display the date as something like 24-03-01 (yyyy-mm-dd) rather than Mar-24.
One way is to use str.extract, looking for any match on day of the month:
months = (pd.to_datetime(pd.Series([*range(1,12)]), format='%m')
.dt.month_name()
.str[:3]
.values.tolist())
pat = rf"((?:{'|'.join(months)})-\d+)"
# '((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov)-\\d+)'
df['DATE '] = df.LAUNCH.str.extract(pat)
print(df)
LAUNCH DATE
0 Step-up Mar-24:x1.5 Mar-24
1 unknown NaN
2 NTV:62.1% NaN
3 Step-up Aug-23:N/A Aug-23
Use str.extract with a named capturing group.
The code to add a new column with the extracting result can be e.g.:
df = pd.concat([df, df.LAUNCH.str.extract(
r'(?P<DATE>(?:Jan|Feb|Ma[ry]|Apr|Ju[nl]|Aug|Sep|Oct|Nov|Dec)-\d{2})')],
axis=1, sort=False)
The result, for your data, is:
LAUNCH DATE
0 Step-up Mar-24:x1.5 Mar-24
1 unknown NaN
2 NTV:62.1% NaN
3 Step-up Aug-23:N/A, Aug-23

Get unique values in column B for each unique record in column A using python/pandas

I'm in search for a quick&productive workaround for the following task.
I need to create a separate column for each DeviceID. The column must contain an array with unique SessionStartDate values for each DeviceID.
For example:
8846620190473426378 | [2018-08-01, 2018-08-02]
381156181455864495 | [2018-08-01]
Though user 8846620190473426378 may have had 30 sessions on 2018-08-01, and 25 sessions on 2018-08-02, I'm only interested in unique dates when these sessions occurred.
Currently, I'm using this approach:
df_main['active_days'] = [
sorted(
list(
set(
sessions['SessionStartDate'].loc[sessions['DeviceID'] == x['DeviceID']]
)
)
)
for _, x in df_main.iterrows()
]
df_main here is another DataFrame, containing aggregated data grouped by DeviceID
The approach seems to be very (Wall time: 1h 45min 58s) slow, and I believe there's a better solution for the task.
Thanks in advance!
I believe you need sort_values with SeriesGroupBy.unique:
rng = pd.date_range('2017-04-03', periods=4)
sessions = pd.DataFrame({'SessionStartDate': rng, 'DeviceID':[1,2,1,2]})
print (sessions)
SessionStartDate DeviceID
0 2017-04-03 1
1 2017-04-04 2
2 2017-04-05 1
3 2017-04-06 2
#if necessary convert datetimes to dates
sessions['SessionStartDate'] = sessions['SessionStartDate'].dt.date
out = (sessions.sort_values('SessionStartDate')
.groupby('DeviceID')['SessionStartDate']
.unique())
print (out)
DeviceID
1 [2017-04-03, 2017-04-05]
2 [2017-04-04, 2017-04-06]
Name: SessionStartDate, dtype: object
Another solution is remove duplicates by drop_duplicates and groupby with converting to lists:
sessions['SessionStartDate'] = sessions['SessionStartDate'].dt.date
out = (sessions.sort_values('SessionStartDate')
.drop_duplicates(['DeviceID', 'SessionStartDate'])
.groupby('DeviceID')['SessionStartDate']
.apply(list))
print (out)
DeviceID
1 [2017-04-03, 2017-04-05]
2 [2017-04-04, 2017-04-06]
Name: SessionStartDate, dtype: object

Categories

Resources