How to give value to pandas column from other column on condition

How to give value to pandas column from other column on condition - python

I am cleaning my dataset and just in one column (dtype obj) has NaN that I want to convert / transform with the same values given by other variable (obj).
Do you know how can I transform those NaN without overwriting the non-NaN values.
Here is an example of I would like to do:
in the areas where there is NaN values I want to set the values of region, in this particular case 'NaN' = 'Europe' and 'NaN' = 'Africa'
Region
Area
USA
NY
Europe
Berlin
Asia
Beijin
Europe
NaN
Africa
NaN
I tried using a for loop: but i guess is wrong
Area_type = df['Area']
def Area_type (x):
for i in Area_type:
if i == "NaN":
i = df['Region']
else:
pass
return Area_type
Thanks a lot

You can instruct pandas to change the value of a column in a subset of records that match a condition by using loc:
df.loc[df["Area"].isna(), "Area"] = df["Region"]
If NaN values are strings, use this:
df.loc[df["Area"] == "NaN", "Area"] = df["Region"]

Related

From a mixed dtype column, extract string from specific column values in python pandas

My dataframe looks like below:
data = {'pred_id':[np.nan, np.nan, 'Pred ID', 258,265,595,658],
'class':[np.nan,np.nan,np.nan,'pork','sausage','chicken','pork'],
'image':['Weight',115.37,'pred','app_images/03112020/Prediction/222_prediction_resized.jpg','app_images/03112020/Prediction/333_prediction_resized.jpg','volume',np.nan]}
df = pd.DataFrame(data)
df
Edited:
I am trying create a new column 'image_name' with values from column 'image'. I want to extract a substring from column 'image' values that contains 'app_images/' in its string, and if not then keep it the same.
I tried the below code and its throwing 'Attribute Error' message.
Help me on how to find the dtype and then extract substring from values that have 'app_images/' and if not then keep the value as it is. I dont know how to fix this. Thanks in advance.
images = []
for i in df['image']:
if i.dtypes == object:
if i.__contains__('app_images/'):
new = i.split('_')[1]
name = new.split('/')[3]+'.jpg'
images.append(name)
else:
images.append(i)
df['image_name'] = images
df

Do not use a loop, use vectorial code, str.extract and a regex.
From your description and code, this seems to be what you expect:
df['image_name'] = (df['image'].str.extract(r'app_images/.*/(\d+)_[^/]+\.jpg',
expand=False)+'.jpg'
)
output:
pred_id class image image_name
0 NaN NaN Weight NaN
1 NaN NaN 115.37 NaN
2 Pred ID NaN pred NaN
3 258 pork app_images/03112020/Prediction/222_prediction_resized.jpg 222.jpg
4 265 sausage app_images/03112020/Prediction/333_prediction_resized.jpg 333.jpg
5 595 chicken volume NaN
6 658 pork NaN NaN
regex demo

How to modify code in Python so as to make calculations only on NOT NaN rows in Pandas?

I have Pandas Data Frame in Python like below:
NR
--------
910517196
921122192
NaN
And by using below code I try to calculate age based on column NR in above Data Frame (it does not matter how below code works, I know that it is correct - briefly I take 6 first values to calculate age, because for example 910517 is 1991-05-17 :)):
df["age"] = (ABT_DATE - pd.to_datetime(df.NR.str[:6], format = '%y%m%d')) / np.timedelta64(1, 'Y')
My problem is: I can modify above code to calculate age only using NOT NaN values in column "NR" in Data Frame, nevertheless some values are NaN.
My question is: How can I modify my code so as to take to calculations only these rows from column "NR" where is not NaN ??
As a result I need something like below, so simply I need to temporarily disregard NaN rows and, where there is a NaN in column NR, insert also a NaN in the calculated age column:
NR age
------------------
910517196 | 30
921122192 | 29
NaN | NaN
How can I do that in Python Pandas ?

df['age']=np.where(df['NR'].notnull(),'your_calculation',np.nan)

Python Pandas dataframe - add a new column based on index value

I have a Python pandas dataframe that looks like this:
year_2000 year_1999 year_1998 year_1997 year_1996 year_1995 (MANH, stock name)
MANH 454.47 -71.90 nan nan nan nan TEST
LH 385.52 180.95 -24.14 -41.67 -68.92 -26.47 TEST
DGX 373.33 68.04 4.01 nan nan nan TEST
SKX 306.56 nan nan nan nan nan TEST
where the stock tickers are the index. I want to add the name of each stock as a new column
I tried adding the stock name column via yearly_best['MANH','stock name']='TEST' but it adds the same name in all rows.
I have a dictionary called ticker_name which contains the tickers and the names
Out[65]:
{'TWOU': '2U',
'MMM': '3M',
'ABT': 'Abbott Laboratories',
'ABBV': 'AbbVie Inc.',
'ABMD': 'Abiomed',
'ACHC': 'Acadia Healthcare',
'ACN': 'Accenture',
'ATVI': 'Activision Blizzard',
'AYI': 'Acuity Brands',
'ADNT': 'Adient',
thus I would like to get the names from the dict and put then in a column in the dataframe. How can I do that?

As the key of your dictionnary are index of your dataFrame you can try:
d = {'TWOU': '2U',
'MMM': '3M',
'ABT': 'Abbott Laboratories',
'ABBV': 'AbbVie Inc.',
'ABMD': 'Abiomed',
'ACHC': 'Acadia Healthcare',
'ACN': 'Accenture',
'ATVI': 'Activision Blizzard',
'AYI': 'Acuity Brands',
'ADNT': 'Adient',}
df['stock name'] = pd.Series(d)

You can try:
# Create a new column "stock_name" with the index values
yearly_best['stock_name'] = yearly_best.index
# Replace the "stock_name" values based on the dictionary
yearly_best['stock_name'].map(ticker_name, inplace=True)
Note that in this case, the dataframe's indices will remain as they were (stock tickers). If you would like to replace the indices with row numbers, consider using reset_index

Boxplot of Multiindex df

I want to do 2 things:
I want to create one boxplot per date/day with all the values for MeanTravelTimeSeconds in that date. The number of MeanTravelTimeSeconds elements varies from date to date (e.g. one day might have a count of 300 values while another, 400).
Also, I want to transform the rows in my multiindex series into columns because I don't want the rows to repeat every time. If it stays like this I'd have tens of millions of unnecessary rows.
Here is the resulting series after using df.stack() on a df indexed by date (date is a datetime object index):
Date
2016-01-02 NumericIndex 1611664
OriginMovementID 4744
DestinationMovementID 5084
MeanTravelTimeSeconds 1233
RangeLowerBoundTravelTimeSeconds 756
...
2020-03-31 DestinationMovementID 3594
MeanTravelTimeSeconds 1778
RangeLowerBoundTravelTimeSeconds 1601
RangeUpperBoundTravelTimeSeconds 1973
DayOfWeek Tuesday
Length: 11281655, dtype: object
When I use seaborn to plot the boxplot I guet a bucnh of errors after playing with different selections.
If I try to do df.stack().unstack() or df.stack().T I get then following error:
Index contains duplicate entries, cannot reshape
How do I plot the boxplot and how do I turn the rows into columns?

You really do need to make your index unique to make the functions you want to work. I suggest a sequential number that resets at every change in the other two key columns.
import datetime as dt
import random
import numpy as np
cat = ["NumericIndex","OriginMovementID","DestinationMovementID","MeanTravelTimeSeconds",
"RangeLowerBoundTravelTimeSeconds"]
df = pd.DataFrame(
[{"Date":d, "Observation":cat[random.randint(0,len(cat)-1)],
"Value":random.randint(1000,10000)}
for i in range(random.randint(5,20))
for d in pd.date_range(dt.datetime(2016,1,2), dt.datetime(2016,3,31), freq="14D")])
# starting point....
df = df.sort_values(["Date","Observation"]).set_index(["Date","Observation"])
# generate an array that is sequential within change of key
seq = np.full(df.index.shape, 0)
s=0
p=""
for i, v in enumerate(df.index):
if i==0 or p!=v: s=0
else: s+=1
seq[i] = s
p=v
df["SeqNo"] = seq
# add to index - now unstack works as required
dfdd = df.set_index(["SeqNo"], append=True)
dfdd.unstack(0).loc["MeanTravelTimeSeconds"].boxplot()
print(dfdd.unstack(1).head().to_string())
output
Value
Observation DestinationMovementID MeanTravelTimeSeconds NumericIndex OriginMovementID RangeLowerBoundTravelTimeSeconds
Date SeqNo
2016-01-02 0 NaN NaN 2560.0 5324.0 5085.0
1 NaN NaN 1066.0 7372.0 NaN
2016-01-16 0 NaN 6226.0 NaN 7832.0 NaN
1 NaN 1384.0 NaN 8839.0 NaN
2 NaN 7892.0 NaN NaN NaN

looping into dates and apply function to pandas dataframe

I'm trying to detect the first dates when an event occur: here in my dataframe for the product A (see pivot table) I have 20 items stored for the first time on 2017-04-03.
so I want to create a new variable calle new_var_2017-04-03 that store the increment. On the other hand on the next day 2017-04-04 I don't mind if the item is now 50 instead of 20, I only want to store only the 1st event
It gives me several errors, I would like to know at least if the entire logic behind it makes sense, it's "pythonic", or if I'm completeley on the wrong way
raw_data = {'name': ['B','A','A','B'],'date' : pd.to_datetime(pd.Series(['2017-03-30','2017-03-31','2017-04-03','2017-04-04'])),
'age': [10,20,50,30]}
df1 = pd.DataFrame(raw_data, columns = ['date','name','age'])
table=pd.pivot_table(df1,index=['name'],columns=['date'],values=['age'],aggfunc='sum')
table
I'm passing the dates to a list
dates=df1['date'].values.tolist()
I want to do a backward loop into my list "dates" and create a variable if an event occurs.
pseudo code: with i-1 I mean the item before i in the list
def my_fun(x,list):
for i in reversed(list):
if (x[i]-x[i-1])>0 :
x[new_var+i]=x[i]-x[i-1]
else:
x[new_var+i]=0
return x
print (df.apply(lambda x: my_fun(x,dates), axis=1))
desidered output:
raw_data2 = {'new_var': ['new_var_2017-03-30','new_var_2017-03-31','new_var_2017-04-03','new_var_2017-04-04'],'result_a': [np.nan,20,np.nan,np.nan],'result_b': [10,np.nan,np.nan,np.nan]}
df2= pd.DataFrame(raw_data2, columns = ['new_var','result_a','result_b'])
df2.T

Let's try this:
df1['age'] = df1.groupby('name')['age'].transform(lambda x: (x==x.min())*x)
df1.pivot_table(index='name', columns='date', values='age').replace(0,np.nan)
date 2017-03-30 2017-03-31 2017-04-03 2017-04-04
name
A NaN 20.0 NaN NaN
B 10.0 NaN NaN NaN

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to give value to pandas column from other column on condition - python

You can instruct pandas to change the value of a column in a subset of records that match a condition by using loc: df.loc[df["Area"].isna(), "Area"] = df["Region"] If NaN values are strings, use this: df.loc[df["Area"] == "NaN", "Area"] = df["Region"]

Related

From a mixed dtype column, extract string from specific column values in python pandas

How to modify code in Python so as to make calculations only on NOT NaN rows in Pandas?

Python Pandas dataframe - add a new column based on index value

Boxplot of Multiindex df

looping into dates and apply function to pandas dataframe

Categories

Resources