How to merge different column under one column in Pandas - python

I have a dataframe which is sparsed and something like this,
Conti_mV_XSCI_140|Conti_mV_XSCI_12|Conti_mV_XSCI_76|Conti_mV_XSCO_11|Conti_mV_XSCO_203|Conti_mV_XSCO_75
1 | nan | nan | 12 | nan | nan
nan | 22 | nan | nan | 13 | nan
nan | nan | 9 | nan | nan | 31
As you can see, XSCI is present in 3 header names, only thing is a random number(_140, _12, _76) is added which makes them different.
This is not correct. The column names should be like this - Conti_mV_XSCI, Conti_mV_XSCO.
and the final column name(without any random number), should be having values from all the three columns it was spread to(for example - xsci was xsci_140, xsci_12,xsci_76) like that.
The final dataframe should look something like this -
Conti_mV_XSCI| Conti_mV_XSCO
1 | 12
22 | 13
99 | 31
If you notice, the first value of XSCI comes from the first XSCI_140, second value comes from the second column with XSCI and so on. This is same for XSCO as well.
The issue is, I have to do this for all the columns starting with certain value, like - "Conti_mV,"IDD_PowerUp_mA" etc
My issue:
I am having a hard time cleaning out the header names because as soon as I remove the random number from the last, it throws an error of columns being duplicate, also it is not elegant
It would be a great help if anyone can help me. Please comment if anything is not clear here.
I need a new dataframe with one column(where there were 3) and combine the data from them.
Thanks.

First if necessary convert all columns to numeric:
df = df.apply(pd.to_numeric, errors='coerce')
If need grouping by column names splited with right side and selected first values:
df = df.groupby(lambda x: x.rsplit('_', 1)[0], axis=1).sum()
print (df)
Conti_mV_XSCI Conti_mV_XSCO
0 1.0 12.0
1 22.0 13.0
2 9.0 31.0
If need filter columns manually:
df['Conti_mV_XSCI'] = df.filter(like='XSCI').sum(axis=1)
df['Conti_mV_XSCO'] = df.filter(like='XSCO').sum(axis=1)
EDIT: One idea for sum only columns specified in list of starts of columns names:
cols = ['IOZH_Pat_uA', 'IOZL_Pat_uA', 'Power_Short_uA', 'IDDQ_uA']
for c in cols:
# here ^ is for start of string
columns = df.filter(regex=f'^{c}')
df[c] = columns.sum(axis=1)
df = df.drop(columns, axis=1)
print (df)

try:
df['Conti_mV_XSCI']=df.filter(regex='XSCI').sum()
df['Conti_mV_XSCO']=df.filter(regex='XSCO').sum()
edit:
you can fillna with zeroes before the above operations.
df=df.fillna(0)

This will add a column Conti_mV_XSCI with the first non-nan entry for any column whose name begins with Conti_mV_XSCI
from math import isnan
df['Conti_mV_XSCI'] = df.filter(regex=("Conti_mV_XSCI.*")).apply(lambda row: [_ for _ in row if not isnan(_)][0], axis=1)

you can use the pivot_longer function from pyjanitor; at the moment you have to install the latest development version from github
# install the latest dev version of pyjanitor
# pip install git+https://github.com/ericmjl/pyjanitor.git
import janitor
(df.pivot_longer(names_to=".value",
names_pattern=r"(.+)_\d+")
.dropna())
Conti_mV_XSCI Conti_mV_XSCO
0 1.0 12.0
4 22.0 13.0
8 9.0 31.0
The code looks for values that match a pattern in the group, and returns those values with the header.

Related

How to modify code in Python so as to make calculations only on NOT NaN rows in Pandas?

I have Pandas Data Frame in Python like below:
NR
--------
910517196
921122192
NaN
And by using below code I try to calculate age based on column NR in above Data Frame (it does not matter how below code works, I know that it is correct - briefly I take 6 first values to calculate age, because for example 910517 is 1991-05-17 :)):
df["age"] = (ABT_DATE - pd.to_datetime(df.NR.str[:6], format = '%y%m%d')) / np.timedelta64(1, 'Y')
My problem is: I can modify above code to calculate age only using NOT NaN values in column "NR" in Data Frame, nevertheless some values are NaN.
My question is: How can I modify my code so as to take to calculations only these rows from column "NR" where is not NaN ??
As a result I need something like below, so simply I need to temporarily disregard NaN rows and, where there is a NaN in column NR, insert also a NaN in the calculated age column:
NR age
------------------
910517196 | 30
921122192 | 29
NaN | NaN
How can I do that in Python Pandas ?
df['age']=np.where(df['NR'].notnull(),'your_calculation',np.nan)

Pandas, how to calculate delta between one cell and another in different rows

I have the following frame:
USERID, EVENT1TIME, EVENT2TIME, MISC1, MISC2
123,45,,,
123,,46,,
123,,47,,
123,,48,,
123,,49,,
123,,51,,
124,45,,,
124,,46,,
124,,47,,
124,,48,,
124,,49,,
124,,51,,
I'd like to add a 4th column that is (EVENT2TIME - EVENT1TIME)
USERID, EVENT1TIME, EVENT2TIME, MISC1, MISC2, DELTA
123,45,,,,
123,,46,,,1
123,,47,,,2
123,,48,,,3
123,,49,,,4
123,,51,,,6
124,45,,,,
124,,46,,,1
124,,47,,,2
124,,48,,,3
124,,49,,,4
124,,51,,,6
I think the first thing to do is to copy the value from the row where EVENT1TIME is populated into the other instances of that USERID. But I suspect there may be a better way.
I am making some assumptions:
You want to calculate the difference between column EVENT2TIME and first row of EVENT1TIME
You want to store the results into DELTA
You can do this as follows:
import pandas as pd
df = pd.read_csv('abc.txt')
print (df)
df['DELTA'] = df.iloc[:,2] - df.iloc[0,1]
print (df)
The output of this will be:
USERID EVENT1TIME EVENT2TIME MISC1 MISC2 DELTA
0 123 45.0 NaN NaN NaN
1 123 NaN 46.0 NaN NaN 1.0
2 123 NaN 47.0 NaN NaN 2.0
3 123 NaN 48.0 NaN NaN 3.0
4 123 NaN 49.0 NaN NaN 4.0
5 123 NaN 51.0 NaN NaN 6.0
If you know EVENT1TIME is always and only in the first row, just store it as a variable and subtract it.
val = df.EVENT1TIME[0]
df['DELTA'] = df.EVENT2TIME - val
If you have multiple values every so often in EVENT1TIME, use some logic to back or forward fill all the empty rows for EVENT1TIME. This fill is not stored in the final output df.
df['DELTA'] = df.EVENT2TIME - df.EVENT1TIME.ffill() # forward fill (down) all nan values
# OR
df['DELTA'] = df.EVENT2TIME - df.EVENT1TIME.bfill() # back fill (up) all nan values
EDIT: Keeping this for continuity despite how hacky it is.
locations = list(df[~np.isnan(df.EVENT1TIME)].index)
vals = df.EVENT1TIME.loc[locations] # all EVENT1TIME values
locations.append(df.EVENT1TIME.index[-1]+1) # last row index + 1
last_loc = locations[0]
for idx, next_loc in enumerate(locations[1:]):
temp = df.loc[last_loc:next_loc-1]
df['DELTA'].loc[last_loc:next_loc-1] = temp.EVENT2VALUE - vals[last_loc]
last_loc = next_loc

add values in Pandas DataFrame

I want to add values in a dataframe. But i want to write clean code (short and faster). I really want to improve my skill in writing.
Suppose that we have a DataFrame and 3 values
df=pd.DataFrame({"Name":[],"ID":[],"LastName":[]})
value1="ema"
value2=023123
value3="Perez"
I can write:
df.append([value1,value2,value3])
but the output is gonna create a new column
like
0 | Name | ID | LastName
ema | nan | nan | nan
023123 | nan | nan| nan
Perez | nan | nan | nan
i want the next output with the best clean code
Name | ID | LastName
ema | 023123 | Perez
There are a way to do this , without append one by one? (i want the best short\fast code)
You can convert the values to dict then use append
df.append(dict(zip(['Name', 'ID', 'LastName'],[value1,value2,value3])), ignore_index=True)
Name ID LastName
0 ema 23123.0 Perez
Here the explanation:
First set your 3 values into an array
values=[value1,value2,value3]
and make variable as index marker when lopping latter
i = 0
Then use the code below
for column in df.columns:
df.loc[0,column] = values[i]
i+=1
column in df.columns will give you all the name of the column in the DataFrame
and df.loc[0,column] = values[i] will set the values at index i to row=0 and column=column
[Here the code and the result]

Using regex to create new column in dataframe

I have a dataframe and in one of its columns i need to pull out specific text and place it into its own column. From the dataframe below i need to take elements of the LAUNCH column and add that into its own column next to it, specifically i need to extract the date in the rows which provide it, for example 'Mar-24'.
df =
|LAUNCH
0|Step-up Mar-24:x1.5
1|unknown
2|NTV:62.1%
3|Step-up Aug-23:N/A,
I would like the output to be something like this:
df =
|LAUNCH |DATE
0|Step-up Mar-24:x1.5 | Mar-24
1|unknown | nan
2|NTV:62.1% | nan
3|Step-up Aug-23:N/A, | Aug-23
And if this can be done, would it also be possible to display the date as something like 24-03-01 (yyyy-mm-dd) rather than Mar-24.
One way is to use str.extract, looking for any match on day of the month:
months = (pd.to_datetime(pd.Series([*range(1,12)]), format='%m')
.dt.month_name()
.str[:3]
.values.tolist())
pat = rf"((?:{'|'.join(months)})-\d+)"
# '((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov)-\\d+)'
df['DATE '] = df.LAUNCH.str.extract(pat)
print(df)
LAUNCH DATE
0 Step-up Mar-24:x1.5 Mar-24
1 unknown NaN
2 NTV:62.1% NaN
3 Step-up Aug-23:N/A Aug-23
Use str.extract with a named capturing group.
The code to add a new column with the extracting result can be e.g.:
df = pd.concat([df, df.LAUNCH.str.extract(
r'(?P<DATE>(?:Jan|Feb|Ma[ry]|Apr|Ju[nl]|Aug|Sep|Oct|Nov|Dec)-\d{2})')],
axis=1, sort=False)
The result, for your data, is:
LAUNCH DATE
0 Step-up Mar-24:x1.5 Mar-24
1 unknown NaN
2 NTV:62.1% NaN
3 Step-up Aug-23:N/A, Aug-23

Deleting a row in pandas dataframe based on condition

Scenario: I have a dataframe with some nan scattered around. It has multiple columns, the ones of interest are "bid" and "ask"
What I want to do: I want to remove all rows where the bid column value is nan AND the ask column value is nan.
Question: What is the best way to do it?
What I already tried:
ab_df = ab_df[ab_df.bid != 'nan' and ab_df.ask != 'nan']
ab_df = ab_df[ab_df.bid.empty and ab_df.ask.empty]
ab_df = ab_df[ab_df.bid.notnull and ab_df.ask.notnull]
But none of them work.
You need vectorized logical operators & or | (and and or from python are to compare scalars not for pandas Series), to check nan values, you can use isnull and notnull:
To remove all rows where the bid column value is nan AND the ask column value is nan, keep the opposite:
ab_df[ab_df.bid.notnull() | ab_df.ask.notnull()]
Example:
df = pd.DataFrame({
"bid": [pd.np.nan, 1, 2, pd.np.nan],
"ask": [pd.np.nan, pd.np.nan, 2, 1]
})
df[df.bid.notnull() | df.ask.notnull()]
# ask bid
#1 NaN 1.0
#2 2.0 2.0
#3 1.0 NaN
If you need both columns to be non missing:
df[df.bid.notnull() & df.ask.notnull()]
# ask bid
#2 2.0 2.0
Another option using dropna by setting the thresh parameter:
df.dropna(subset=['ask', 'bid'], thresh=1)
# ask bid
#1 NaN 1.0
#2 2.0 2.0
#3 1.0 NaN
df.dropna(subset=['ask', 'bid'], thresh=2)
# ask bid
#2 2.0 2.0
ab_df = ab_df.loc[~ab_df.bid.isnull() | ~ab_df.ask.isnull()]
all this time I've been usign that because i convinced myself that .notnull() didn't exist. TIL.
ab_df = ab_df.loc[ab_df.bid.notnull() | ab_df.ask.notnull()]
The key is & rather than and and | rather than or
I made a mistake earlier using & - this is wrong because you want either bid isn't null OR ask isn't null, using and would give you only the rows where both are not null.
I think you can ab_df.dropna() as well, but i'll have to look it up
EDIT
oddly df.dropna() doesn't seem to support dropping based on NAs in a specific column. I would have thought it did.
based on the other answer I now see it does. It's friday afternoon, ok?

Categories

Resources