Using regex to create new column in dataframe - python

I have a dataframe and in one of its columns i need to pull out specific text and place it into its own column. From the dataframe below i need to take elements of the LAUNCH column and add that into its own column next to it, specifically i need to extract the date in the rows which provide it, for example 'Mar-24'.
df =
|LAUNCH
0|Step-up Mar-24:x1.5
1|unknown
2|NTV:62.1%
3|Step-up Aug-23:N/A,
I would like the output to be something like this:
df =
|LAUNCH |DATE
0|Step-up Mar-24:x1.5 | Mar-24
1|unknown | nan
2|NTV:62.1% | nan
3|Step-up Aug-23:N/A, | Aug-23
And if this can be done, would it also be possible to display the date as something like 24-03-01 (yyyy-mm-dd) rather than Mar-24.

One way is to use str.extract, looking for any match on day of the month:
months = (pd.to_datetime(pd.Series([*range(1,12)]), format='%m')
.dt.month_name()
.str[:3]
.values.tolist())
pat = rf"((?:{'|'.join(months)})-\d+)"
# '((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov)-\\d+)'
df['DATE '] = df.LAUNCH.str.extract(pat)
print(df)
LAUNCH DATE
0 Step-up Mar-24:x1.5 Mar-24
1 unknown NaN
2 NTV:62.1% NaN
3 Step-up Aug-23:N/A Aug-23

Use str.extract with a named capturing group.
The code to add a new column with the extracting result can be e.g.:
df = pd.concat([df, df.LAUNCH.str.extract(
r'(?P<DATE>(?:Jan|Feb|Ma[ry]|Apr|Ju[nl]|Aug|Sep|Oct|Nov|Dec)-\d{2})')],
axis=1, sort=False)
The result, for your data, is:
LAUNCH DATE
0 Step-up Mar-24:x1.5 Mar-24
1 unknown NaN
2 NTV:62.1% NaN
3 Step-up Aug-23:N/A, Aug-23

Related

How to modify code in Python so as to make calculations only on NOT NaN rows in Pandas?

I have Pandas Data Frame in Python like below:
NR
--------
910517196
921122192
NaN
And by using below code I try to calculate age based on column NR in above Data Frame (it does not matter how below code works, I know that it is correct - briefly I take 6 first values to calculate age, because for example 910517 is 1991-05-17 :)):
df["age"] = (ABT_DATE - pd.to_datetime(df.NR.str[:6], format = '%y%m%d')) / np.timedelta64(1, 'Y')
My problem is: I can modify above code to calculate age only using NOT NaN values in column "NR" in Data Frame, nevertheless some values are NaN.
My question is: How can I modify my code so as to take to calculations only these rows from column "NR" where is not NaN ??
As a result I need something like below, so simply I need to temporarily disregard NaN rows and, where there is a NaN in column NR, insert also a NaN in the calculated age column:
NR age
------------------
910517196 | 30
921122192 | 29
NaN | NaN
How can I do that in Python Pandas ?
df['age']=np.where(df['NR'].notnull(),'your_calculation',np.nan)

Filtering, transposing and concatenating with Pandas

I'm trying something i've never done before and i'm in need of some help.
Basically, i need to filter sections of a pandas dataframe, transpose each filtered section and then concatenate every resulting section together.
Here's a representation of my dataframe:
df:
id | text_field | text_value
1 Date 2021-06-23
1 Hour 10:50
2 Position City
2 Position Countryside
3 Date 2021-06-22
3 Hour 10:45
I can then use some filtering method to isolate parts of my data:
df.groupby('id').filter(lambda x: True)
test = df.query(' id == 1 ')
test = test[["text_field","text_value"]]
test_t = test.set_index("text_field").T
test_t:
text_field | Date | Hour
text_value | 2021-06-23 | 10:50
If repeat the process looking for row with id == 3 and then concatenate the result with test_t, i'll have the following:
text_field | Date | Hour
text_value | 2021-06-23 | 10:50
text_value | 2021-06-22 | 10:45
I'm aware that performing this with rows where id == 2 will give me other columns and that's alright too, it's what a i want as well.
What i can't figure out is how to do this for every "id" in my dataframe. I wasn't able to create a function or for loop that works. Can somebody help me?
To summarize:
1 - I need to separate my dataframe in sections according with values from the "id" column
2 - After that i need to remove the "id" column and transpose the result
3 - I need to concatenate every resulting dataframe into one big dataframe
You can use pivot_table:
df.pivot_table(
index='id', columns='text_field', values='text_value', aggfunc='first')
Output:
text_field Date Hour Position
id
1 2021-06-23 10:50 NaN
2 NaN NaN City
3 2021-06-22 10:45 NaN
It's not exactly clear how you want to deal with repeating values though, would be great to have some description of that (id=2 would make a good example)
Update: If you want to ignore the ids and simply concatenate all the values:
pd.DataFrame(df.groupby('text_field')['text_value'].apply(list).to_dict())
Output:
Date Hour Position
0 2021-06-23 10:50 City
1 2021-06-22 10:45 Countryside

How to fill missing dates with corresponding NaN in other columns

I have a CSV that initially creates following dataframe:
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-05 52304.0
Using the following script, I would like to fill the missing dates and have a corresponding NaN value in the Portfoliovalue column with NaN. So the result would be this:
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-02 NaN
2 2021-05-03 NaN
3 2021-05-04 NaN
4 2021-05-05 52304.0
I first tried the method here: Fill the missing date values in a Pandas Dataframe column
However the bfill replaces all my NaN's and removing it only returns an error.
So far I have tried this:
df = pd.read_csv("Tickers_test5.csv")
df2 = pd.read_csv("Portfoliovalues.csv")
portfolio_value = df['Currentvalue'].sum()
portfolio_value = portfolio_value + cash
date = datetime.date(datetime.now())
df2.loc[len(df2)] = [date, portfolio_value]
print(df2.asfreq('D'))
However, this only returns this:
Date Portfoliovalue
1970-01-01 NaN NaN
Thanks for your help. I am really impressed at how helpful this community is.
Quick update:
I have added the code, so that it fills my missing dates. However, it is part of a programme, which tries to update the missing dates every time it launches. So when I execute the code and no dates are missing, I get the following error:
ValueError: cannot reindex from a duplicate axis”
The code is as follows:
df2 = pd.read_csv("Portfoliovalues.csv")
portfolio_value = df['Currentvalue'].sum()
date = datetime.date(datetime.now())
df2.loc[date, 'Portfoliovalue'] = portfolio_value
#Solution provided by Uts after asking on Stackoverflow
df2.Date = pd.to_datetime(df2.Date)
df2 = df2.set_index('Date').asfreq('D').reset_index()
So by the looks of it the code adds a duplicate date, which then causes the .reindex() function to raise the ValueError. However, I am not sure how to proceed. Is there an alternative to .reindex() or maybe the assignment of today's date needs changing?
Pandas has asfreq function for datetimeIndex, this is basically just a thin, but convenient wrapper around reindex() which generates a date_range and calls reindex.
Code
df.Date = pd.to_datetime(df.Date)
df = df.set_index('Date').asfreq('D').reset_index()
Output
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-02 NaN
2 2021-05-03 NaN
3 2021-05-04 NaN
4 2021-05-05 52304.0
Pandas has reindex method: given a list of indices, it remains only indices from list.
In your case, you can create all the dates you want, by date_range for example, and then give it to reindex. you might needed a simple set_index and reset_index, but I assume you don't care much about the original index.
Example:
df.set_index('Date').reindex(pd.date_range(start=df['Date'].min(), end=df['Date'].max(), freq='D')).reset_index()
On first we set 'Date' column as index. Then we use reindex, it full list of dates (given by date_range from minimal date to maximal date in 'Date' column, with daily frequency) as new index. It result nans in places without former value.

How to merge different column under one column in Pandas

I have a dataframe which is sparsed and something like this,
Conti_mV_XSCI_140|Conti_mV_XSCI_12|Conti_mV_XSCI_76|Conti_mV_XSCO_11|Conti_mV_XSCO_203|Conti_mV_XSCO_75
1 | nan | nan | 12 | nan | nan
nan | 22 | nan | nan | 13 | nan
nan | nan | 9 | nan | nan | 31
As you can see, XSCI is present in 3 header names, only thing is a random number(_140, _12, _76) is added which makes them different.
This is not correct. The column names should be like this - Conti_mV_XSCI, Conti_mV_XSCO.
and the final column name(without any random number), should be having values from all the three columns it was spread to(for example - xsci was xsci_140, xsci_12,xsci_76) like that.
The final dataframe should look something like this -
Conti_mV_XSCI| Conti_mV_XSCO
1 | 12
22 | 13
99 | 31
If you notice, the first value of XSCI comes from the first XSCI_140, second value comes from the second column with XSCI and so on. This is same for XSCO as well.
The issue is, I have to do this for all the columns starting with certain value, like - "Conti_mV,"IDD_PowerUp_mA" etc
My issue:
I am having a hard time cleaning out the header names because as soon as I remove the random number from the last, it throws an error of columns being duplicate, also it is not elegant
It would be a great help if anyone can help me. Please comment if anything is not clear here.
I need a new dataframe with one column(where there were 3) and combine the data from them.
Thanks.
First if necessary convert all columns to numeric:
df = df.apply(pd.to_numeric, errors='coerce')
If need grouping by column names splited with right side and selected first values:
df = df.groupby(lambda x: x.rsplit('_', 1)[0], axis=1).sum()
print (df)
Conti_mV_XSCI Conti_mV_XSCO
0 1.0 12.0
1 22.0 13.0
2 9.0 31.0
If need filter columns manually:
df['Conti_mV_XSCI'] = df.filter(like='XSCI').sum(axis=1)
df['Conti_mV_XSCO'] = df.filter(like='XSCO').sum(axis=1)
EDIT: One idea for sum only columns specified in list of starts of columns names:
cols = ['IOZH_Pat_uA', 'IOZL_Pat_uA', 'Power_Short_uA', 'IDDQ_uA']
for c in cols:
# here ^ is for start of string
columns = df.filter(regex=f'^{c}')
df[c] = columns.sum(axis=1)
df = df.drop(columns, axis=1)
print (df)
try:
df['Conti_mV_XSCI']=df.filter(regex='XSCI').sum()
df['Conti_mV_XSCO']=df.filter(regex='XSCO').sum()
edit:
you can fillna with zeroes before the above operations.
df=df.fillna(0)
This will add a column Conti_mV_XSCI with the first non-nan entry for any column whose name begins with Conti_mV_XSCI
from math import isnan
df['Conti_mV_XSCI'] = df.filter(regex=("Conti_mV_XSCI.*")).apply(lambda row: [_ for _ in row if not isnan(_)][0], axis=1)
you can use the pivot_longer function from pyjanitor; at the moment you have to install the latest development version from github
# install the latest dev version of pyjanitor
# pip install git+https://github.com/ericmjl/pyjanitor.git
import janitor
(df.pivot_longer(names_to=".value",
names_pattern=r"(.+)_\d+")
.dropna())
Conti_mV_XSCI Conti_mV_XSCO
0 1.0 12.0
4 22.0 13.0
8 9.0 31.0
The code looks for values that match a pattern in the group, and returns those values with the header.

Grouping data to complete records between each other

I've a task where I need to clean my data with duplicate records but at the same time fill those cells with nan with the values of the records with the same name for example:
id id2 name other_n date country
1.177.002 nan test_name nan 8 decembre 1981 usa
1.177.002 A test_name ALVA nan nan
Until now I tried the normal groupby but I don't get the result I expected
tst.groupby('name').mean()
tst.groupby('name').sum()
The result I'm looking for should be look like this:
id id2 name other_n date country
1.177.002 A test_name ALVA 8 decembre 1981 usa
Run:
df.groupby('name', as_index=False)\
.agg(lambda col: col.loc[col.first_valid_index()])\
.reindex(df.columns, axis=1)
The final reindex is needed to bring the column order back to how
they are ordered in the source DataFrame. Otherwise name would be moved
to the first place

Categories

Resources