Grouping data to complete records between each other - python

I've a task where I need to clean my data with duplicate records but at the same time fill those cells with nan with the values of the records with the same name for example:
id id2 name other_n date country
1.177.002 nan test_name nan 8 decembre 1981 usa
1.177.002 A test_name ALVA nan nan
Until now I tried the normal groupby but I don't get the result I expected
tst.groupby('name').mean()
tst.groupby('name').sum()
The result I'm looking for should be look like this:
id id2 name other_n date country
1.177.002 A test_name ALVA 8 decembre 1981 usa

Run:
df.groupby('name', as_index=False)\
.agg(lambda col: col.loc[col.first_valid_index()])\
.reindex(df.columns, axis=1)
The final reindex is needed to bring the column order back to how
they are ordered in the source DataFrame. Otherwise name would be moved
to the first place

Related

How can I filter values in a dataframe that are not in DD/MM/YYYY format

At the moment i'm trying to write some code that will scan through a dataframe and find any values that are not in valid DD/MM/YYYY format and export this data into a separate dataframe. For example:
Incident Ref User Priority level Date raised Date Resolved
38103 Bruce Banner Priority 2 07/05/2022 08/05/2022
35210 Thor Odinson Priority 1 02/05/2022 04/05/2022
10491 Tony Stark Priority 1 29/04/2022 29/04/2022
48109 Nick Fury Priority 3 abc 20/05/2022
58391 Natasha Romanoff Priority 2 31/02/2021 01/03/2022
Within this dataframe, the last two entries are invalid, one because it is in the wrong format, and one because it is out of range. I want the code to filter through the dataframe and split it into two separate dataframes, one with correct values and one that includes the erroneous data as follows:
Incident Ref User Priority level Date raised Date Resolved
48109 Nick Fury Priority 3 abc 20/05/2022
58391 Natasha Romanoff Priority 2 31/02/2021 01/03/2022
I've tried the following:
df['Date raised'] = pd.to_datetime(df['Date raised'], format='%Y%m%d', errors='coerce')
However this just removes the removes the erroneous entries and doesn't preserve them for use in another dataframe.
Is there a way to do this?
Thanks!
You can chain conditions by | for bitwise OR with test if missing values after converting to datetimes, here is used %d/%m/%Y for match format DD/MM/YYYY:
m1 = pd.to_datetime(df['Date raised'], format='%d/%m/%Y', errors='coerce').isna()
m2 = pd.to_datetime(df['Date Resolved'], format='%d/%m/%Y', errors='coerce').isna()
df = df[m1 | m2]
print (df)
Incident Ref User Priority level Date raised Date Resolved
3 48109 Nick Fury Priority 3 abc 20/05/2022
4 58391 Natasha Romanoff Priority 2 31/02/2021 01/03/2022
Or is possible create list of masks and then chain them with:
cols = ['Date raised','Date Resolved']
masks = [pd.to_datetime(df[x], format='%d/%m/%Y', errors='coerce').isna() for x in cols]
df = df[np.logical_or.reduce(masks)]
Another possible solution is remove format parameter, but e.g. 2021 parse like valid datetime, because converting to 2021-01-01:
cols = ['Date raised','Date Resolved']
masks = [pd.to_datetime(df[x], errors='coerce').isna() for x in cols]
df = df[np.logical_or.reduce(masks)]
print (df)
Incident Ref User Priority level Date raised Date Resolved
3 48109 Nick Fury Priority 3 abc 20/05/2022
4 58391 Natasha Romanoff Priority 2 31/02/2021 01/03/2022

How to fill missing dates with corresponding NaN in other columns

I have a CSV that initially creates following dataframe:
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-05 52304.0
Using the following script, I would like to fill the missing dates and have a corresponding NaN value in the Portfoliovalue column with NaN. So the result would be this:
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-02 NaN
2 2021-05-03 NaN
3 2021-05-04 NaN
4 2021-05-05 52304.0
I first tried the method here: Fill the missing date values in a Pandas Dataframe column
However the bfill replaces all my NaN's and removing it only returns an error.
So far I have tried this:
df = pd.read_csv("Tickers_test5.csv")
df2 = pd.read_csv("Portfoliovalues.csv")
portfolio_value = df['Currentvalue'].sum()
portfolio_value = portfolio_value + cash
date = datetime.date(datetime.now())
df2.loc[len(df2)] = [date, portfolio_value]
print(df2.asfreq('D'))
However, this only returns this:
Date Portfoliovalue
1970-01-01 NaN NaN
Thanks for your help. I am really impressed at how helpful this community is.
Quick update:
I have added the code, so that it fills my missing dates. However, it is part of a programme, which tries to update the missing dates every time it launches. So when I execute the code and no dates are missing, I get the following error:
ValueError: cannot reindex from a duplicate axis”
The code is as follows:
df2 = pd.read_csv("Portfoliovalues.csv")
portfolio_value = df['Currentvalue'].sum()
date = datetime.date(datetime.now())
df2.loc[date, 'Portfoliovalue'] = portfolio_value
#Solution provided by Uts after asking on Stackoverflow
df2.Date = pd.to_datetime(df2.Date)
df2 = df2.set_index('Date').asfreq('D').reset_index()
So by the looks of it the code adds a duplicate date, which then causes the .reindex() function to raise the ValueError. However, I am not sure how to proceed. Is there an alternative to .reindex() or maybe the assignment of today's date needs changing?
Pandas has asfreq function for datetimeIndex, this is basically just a thin, but convenient wrapper around reindex() which generates a date_range and calls reindex.
Code
df.Date = pd.to_datetime(df.Date)
df = df.set_index('Date').asfreq('D').reset_index()
Output
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-02 NaN
2 2021-05-03 NaN
3 2021-05-04 NaN
4 2021-05-05 52304.0
Pandas has reindex method: given a list of indices, it remains only indices from list.
In your case, you can create all the dates you want, by date_range for example, and then give it to reindex. you might needed a simple set_index and reset_index, but I assume you don't care much about the original index.
Example:
df.set_index('Date').reindex(pd.date_range(start=df['Date'].min(), end=df['Date'].max(), freq='D')).reset_index()
On first we set 'Date' column as index. Then we use reindex, it full list of dates (given by date_range from minimal date to maximal date in 'Date' column, with daily frequency) as new index. It result nans in places without former value.

Using regex to create new column in dataframe

I have a dataframe and in one of its columns i need to pull out specific text and place it into its own column. From the dataframe below i need to take elements of the LAUNCH column and add that into its own column next to it, specifically i need to extract the date in the rows which provide it, for example 'Mar-24'.
df =
|LAUNCH
0|Step-up Mar-24:x1.5
1|unknown
2|NTV:62.1%
3|Step-up Aug-23:N/A,
I would like the output to be something like this:
df =
|LAUNCH |DATE
0|Step-up Mar-24:x1.5 | Mar-24
1|unknown | nan
2|NTV:62.1% | nan
3|Step-up Aug-23:N/A, | Aug-23
And if this can be done, would it also be possible to display the date as something like 24-03-01 (yyyy-mm-dd) rather than Mar-24.
One way is to use str.extract, looking for any match on day of the month:
months = (pd.to_datetime(pd.Series([*range(1,12)]), format='%m')
.dt.month_name()
.str[:3]
.values.tolist())
pat = rf"((?:{'|'.join(months)})-\d+)"
# '((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov)-\\d+)'
df['DATE '] = df.LAUNCH.str.extract(pat)
print(df)
LAUNCH DATE
0 Step-up Mar-24:x1.5 Mar-24
1 unknown NaN
2 NTV:62.1% NaN
3 Step-up Aug-23:N/A Aug-23
Use str.extract with a named capturing group.
The code to add a new column with the extracting result can be e.g.:
df = pd.concat([df, df.LAUNCH.str.extract(
r'(?P<DATE>(?:Jan|Feb|Ma[ry]|Apr|Ju[nl]|Aug|Sep|Oct|Nov|Dec)-\d{2})')],
axis=1, sort=False)
The result, for your data, is:
LAUNCH DATE
0 Step-up Mar-24:x1.5 Mar-24
1 unknown NaN
2 NTV:62.1% NaN
3 Step-up Aug-23:N/A, Aug-23

Mapping a new column to a DataFrame by rows from another DataFrame

I have a Pandas DataFrame stations with index as id:
id station lat lng
1 Boston 45.343 -45.333
2 New York 56.444 -35.690
I have another DataFrame df1 that has the following:
duration date station gender
NaN 20181118 NaN M
9 20181009 2.0 F
8 20170605 1.0 F
I want to add to df1 so that it looks like the following DataFrame:
duration date station gender lat lng
NaN 20181118 NaN M nan nan
9 20181009 New York F 56.444 -35.690
8 20170605 Boston F 45.343 -45.333
I tried doing this iteratively by referring to the station.iloc[] as shown in the following example but I have about 2 mil rows and it ended up taking a lot of time.
stat_list = []
lng_list []
lat_list = []
for stat in df1:
if not np.isnan(stat):
ref = station.iloc[stat]
stat_list.append(ref.station)
lng_list.append(ref.lng)
lat_list.append(ref.lat)
else:
stat_list.append(np.nan)
lng_list.append(np.nan)
lat_list.append(np.nan)
Is there a faster way to do this?
Looks like this would be best solved with a merge which should significantly boost performance:
df1.merge(stations, left_on="station", right_index=True, how="left")
This will leave you with two columns station_x and station_y if you only want the station column with the string names in you can do:
df_merged = df1.merge(stations, left_on="station", right_index=True, how="left", suffixes=("_x", ""))
df_final = df_merged[df_merged.columns.difference(["station_x"])]
(or just rename one of them before you merge)

Sorting month columns in pandas pivot_table

I have a dataset made up of 4 columns, a numerator denominator, country, and month. I am pivoting it to get months as columns, country as index, and values as sum(numerator)/sum(denominator). The only problem I get is that my columns are all out of order. How can I sort the columns so earlier months appear first? I tried table = table.sort_index(1) with no luck.
table = pd.pivot_table(df, values=['Numerator', 'Denominator'], index='Country',
columns=['Month'], aggfunc=np.sum)
table = table['Numerator'] / table['Denominator']
Edit with full example and data:
Data:
Denominator,Numerator,Country,Month
10,4,USA,1-Jan
6,2,USA,1-Jan
10,1,Canada,1-Jan
9,2,Canada,1-Jan
6,4,Canada,1-Feb
4,3,Canada,1-Feb
Code:
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv')
table = pd.pivot_table(df, values=['Numerator', 'Denominator'], index='Country',
columns=['Month'], aggfunc=np.sum)
table = table['Numerator'] / table['Denominator']
print table
Output:
Month 1-Feb 1-Jan
Country
Canada 0.7 0.157895
USA NaN 0.37500
Desired Output:
Month 1-Jan 1-Feb
Country
Canada 0.157895 0.7
USA 0.37500 NaN
Option 1
Impose sorting order for pivot, before pivot
This option works because pivot automatically sorts index and column values and displays them. Currently, Month is a string, so sorting will be done lexicographically. You can change this by a datetime conversion.
df.Month = (pd.to_datetime(df.Month, format='%d-%b'))
table = pd.pivot_table(
df,
values=['Numerator', 'Denominator'],
index='Country',
columns=['Month'],
aggfunc=np.sum
)
table = table['Numerator'] / table['Denominator']
table.columns = table.columns.strftime('%d-%b')
table
01-Jan 01-Feb
Country
Canada 0.157895 0.7
USA 0.375000 NaN
Option 2
Reorder after pivot
If your data is stored in chronological order, you can just find df.Month.unique and use it to reindex your result.
table.reindex(columns=df.Month.unique())
Month 1-Jan 1-Feb
Country
Canada 0.157895 0.7
USA 0.375000 NaN
If that isn't the case (and your data isn't chronologically ordered), here's a little workaround using pd.to_datetime + pd.Series.argsort + unique.
u = df.Month.iloc[
pd.to_datetime(df.Month, format='%d-%b').argsort()
].unique()
table.reindex(columns=u)
Month 1-Jan 1-Feb
Country
Canada 0.157895 0.7
USA 0.375000 NaN

Categories

Resources