How to give value to pandas column from other column on condition - python

I am cleaning my dataset and just in one column (dtype obj) has NaN that I want to convert / transform with the same values given by other variable (obj).
Do you know how can I transform those NaN without overwriting the non-NaN values.
Here is an example of I would like to do:
in the areas where there is NaN values I want to set the values of region, in this particular case 'NaN' = 'Europe' and 'NaN' = 'Africa'
Region
Area
USA
NY
Europe
Berlin
Asia
Beijin
Europe
NaN
Africa
NaN
I tried using a for loop: but i guess is wrong
Area_type = df['Area']
def Area_type (x):
for i in Area_type:
if i == "NaN":
i = df['Region']
else:
pass
return Area_type
Thanks a lot

You can instruct pandas to change the value of a column in a subset of records that match a condition by using loc:
df.loc[df["Area"].isna(), "Area"] = df["Region"]
If NaN values are strings, use this:
df.loc[df["Area"] == "NaN", "Area"] = df["Region"]

Related

From a mixed dtype column, extract string from specific column values in python pandas

My dataframe looks like below:
data = {'pred_id':[np.nan, np.nan, 'Pred ID', 258,265,595,658],
'class':[np.nan,np.nan,np.nan,'pork','sausage','chicken','pork'],
'image':['Weight',115.37,'pred','app_images/03112020/Prediction/222_prediction_resized.jpg','app_images/03112020/Prediction/333_prediction_resized.jpg','volume',np.nan]}
df = pd.DataFrame(data)
df
Edited:
I am trying create a new column 'image_name' with values from column 'image'. I want to extract a substring from column 'image' values that contains 'app_images/' in its string, and if not then keep it the same.
I tried the below code and its throwing 'Attribute Error' message.
Help me on how to find the dtype and then extract substring from values that have 'app_images/' and if not then keep the value as it is. I dont know how to fix this. Thanks in advance.
images = []
for i in df['image']:
if i.dtypes == object:
if i.__contains__('app_images/'):
new = i.split('_')[1]
name = new.split('/')[3]+'.jpg'
images.append(name)
else:
images.append(i)
df['image_name'] = images
df
Do not use a loop, use vectorial code, str.extract and a regex.
From your description and code, this seems to be what you expect:
df['image_name'] = (df['image'].str.extract(r'app_images/.*/(\d+)_[^/]+\.jpg',
expand=False)+'.jpg'
)
output:
pred_id class image image_name
0 NaN NaN Weight NaN
1 NaN NaN 115.37 NaN
2 Pred ID NaN pred NaN
3 258 pork app_images/03112020/Prediction/222_prediction_resized.jpg 222.jpg
4 265 sausage app_images/03112020/Prediction/333_prediction_resized.jpg 333.jpg
5 595 chicken volume NaN
6 658 pork NaN NaN
regex demo

How to modify code in Python so as to make calculations only on NOT NaN rows in Pandas?

I have Pandas Data Frame in Python like below:
NR
--------
910517196
921122192
NaN
And by using below code I try to calculate age based on column NR in above Data Frame (it does not matter how below code works, I know that it is correct - briefly I take 6 first values to calculate age, because for example 910517 is 1991-05-17 :)):
df["age"] = (ABT_DATE - pd.to_datetime(df.NR.str[:6], format = '%y%m%d')) / np.timedelta64(1, 'Y')
My problem is: I can modify above code to calculate age only using NOT NaN values in column "NR" in Data Frame, nevertheless some values are NaN.
My question is: How can I modify my code so as to take to calculations only these rows from column "NR" where is not NaN ??
As a result I need something like below, so simply I need to temporarily disregard NaN rows and, where there is a NaN in column NR, insert also a NaN in the calculated age column:
NR age
------------------
910517196 | 30
921122192 | 29
NaN | NaN
How can I do that in Python Pandas ?
df['age']=np.where(df['NR'].notnull(),'your_calculation',np.nan)

Python Pandas dataframe - add a new column based on index value

I have a Python pandas dataframe that looks like this:
year_2000 year_1999 year_1998 year_1997 year_1996 year_1995 (MANH, stock name)
MANH 454.47 -71.90 nan nan nan nan TEST
LH 385.52 180.95 -24.14 -41.67 -68.92 -26.47 TEST
DGX 373.33 68.04 4.01 nan nan nan TEST
SKX 306.56 nan nan nan nan nan TEST
where the stock tickers are the index. I want to add the name of each stock as a new column
I tried adding the stock name column via yearly_best['MANH','stock name']='TEST' but it adds the same name in all rows.
I have a dictionary called ticker_name which contains the tickers and the names
Out[65]:
{'TWOU': '2U',
'MMM': '3M',
'ABT': 'Abbott Laboratories',
'ABBV': 'AbbVie Inc.',
'ABMD': 'Abiomed',
'ACHC': 'Acadia Healthcare',
'ACN': 'Accenture',
'ATVI': 'Activision Blizzard',
'AYI': 'Acuity Brands',
'ADNT': 'Adient',
thus I would like to get the names from the dict and put then in a column in the dataframe. How can I do that?
As the key of your dictionnary are index of your dataFrame you can try:
d = {'TWOU': '2U',
'MMM': '3M',
'ABT': 'Abbott Laboratories',
'ABBV': 'AbbVie Inc.',
'ABMD': 'Abiomed',
'ACHC': 'Acadia Healthcare',
'ACN': 'Accenture',
'ATVI': 'Activision Blizzard',
'AYI': 'Acuity Brands',
'ADNT': 'Adient',}
df['stock name'] = pd.Series(d)
You can try:
# Create a new column "stock_name" with the index values
yearly_best['stock_name'] = yearly_best.index
# Replace the "stock_name" values based on the dictionary
yearly_best['stock_name'].map(ticker_name, inplace=True)
Note that in this case, the dataframe's indices will remain as they were (stock tickers). If you would like to replace the indices with row numbers, consider using reset_index

Boxplot of Multiindex df

I want to do 2 things:
I want to create one boxplot per date/day with all the values for MeanTravelTimeSeconds in that date. The number of MeanTravelTimeSeconds elements varies from date to date (e.g. one day might have a count of 300 values while another, 400).
Also, I want to transform the rows in my multiindex series into columns because I don't want the rows to repeat every time. If it stays like this I'd have tens of millions of unnecessary rows.
Here is the resulting series after using df.stack() on a df indexed by date (date is a datetime object index):
Date
2016-01-02 NumericIndex 1611664
OriginMovementID 4744
DestinationMovementID 5084
MeanTravelTimeSeconds 1233
RangeLowerBoundTravelTimeSeconds 756
...
2020-03-31 DestinationMovementID 3594
MeanTravelTimeSeconds 1778
RangeLowerBoundTravelTimeSeconds 1601
RangeUpperBoundTravelTimeSeconds 1973
DayOfWeek Tuesday
Length: 11281655, dtype: object
When I use seaborn to plot the boxplot I guet a bucnh of errors after playing with different selections.
If I try to do df.stack().unstack() or df.stack().T I get then following error:
Index contains duplicate entries, cannot reshape
How do I plot the boxplot and how do I turn the rows into columns?
You really do need to make your index unique to make the functions you want to work. I suggest a sequential number that resets at every change in the other two key columns.
import datetime as dt
import random
import numpy as np
cat = ["NumericIndex","OriginMovementID","DestinationMovementID","MeanTravelTimeSeconds",
"RangeLowerBoundTravelTimeSeconds"]
df = pd.DataFrame(
[{"Date":d, "Observation":cat[random.randint(0,len(cat)-1)],
"Value":random.randint(1000,10000)}
for i in range(random.randint(5,20))
for d in pd.date_range(dt.datetime(2016,1,2), dt.datetime(2016,3,31), freq="14D")])
# starting point....
df = df.sort_values(["Date","Observation"]).set_index(["Date","Observation"])
# generate an array that is sequential within change of key
seq = np.full(df.index.shape, 0)
s=0
p=""
for i, v in enumerate(df.index):
if i==0 or p!=v: s=0
else: s+=1
seq[i] = s
p=v
df["SeqNo"] = seq
# add to index - now unstack works as required
dfdd = df.set_index(["SeqNo"], append=True)
dfdd.unstack(0).loc["MeanTravelTimeSeconds"].boxplot()
print(dfdd.unstack(1).head().to_string())
output
Value
Observation DestinationMovementID MeanTravelTimeSeconds NumericIndex OriginMovementID RangeLowerBoundTravelTimeSeconds
Date SeqNo
2016-01-02 0 NaN NaN 2560.0 5324.0 5085.0
1 NaN NaN 1066.0 7372.0 NaN
2016-01-16 0 NaN 6226.0 NaN 7832.0 NaN
1 NaN 1384.0 NaN 8839.0 NaN
2 NaN 7892.0 NaN NaN NaN

looping into dates and apply function to pandas dataframe

I'm trying to detect the first dates when an event occur: here in my dataframe for the product A (see pivot table) I have 20 items stored for the first time on 2017-04-03.
so I want to create a new variable calle new_var_2017-04-03 that store the increment. On the other hand on the next day 2017-04-04 I don't mind if the item is now 50 instead of 20, I only want to store only the 1st event
It gives me several errors, I would like to know at least if the entire logic behind it makes sense, it's "pythonic", or if I'm completeley on the wrong way
raw_data = {'name': ['B','A','A','B'],'date' : pd.to_datetime(pd.Series(['2017-03-30','2017-03-31','2017-04-03','2017-04-04'])),
'age': [10,20,50,30]}
df1 = pd.DataFrame(raw_data, columns = ['date','name','age'])
table=pd.pivot_table(df1,index=['name'],columns=['date'],values=['age'],aggfunc='sum')
table
I'm passing the dates to a list
dates=df1['date'].values.tolist()
I want to do a backward loop into my list "dates" and create a variable if an event occurs.
pseudo code: with i-1 I mean the item before i in the list
def my_fun(x,list):
for i in reversed(list):
if (x[i]-x[i-1])>0 :
x[new_var+i]=x[i]-x[i-1]
else:
x[new_var+i]=0
return x
print (df.apply(lambda x: my_fun(x,dates), axis=1))
desidered output:
raw_data2 = {'new_var': ['new_var_2017-03-30','new_var_2017-03-31','new_var_2017-04-03','new_var_2017-04-04'],'result_a': [np.nan,20,np.nan,np.nan],'result_b': [10,np.nan,np.nan,np.nan]}
df2= pd.DataFrame(raw_data2, columns = ['new_var','result_a','result_b'])
df2.T
Let's try this:
df1['age'] = df1.groupby('name')['age'].transform(lambda x: (x==x.min())*x)
df1.pivot_table(index='name', columns='date', values='age').replace(0,np.nan)
date 2017-03-30 2017-03-31 2017-04-03 2017-04-04
name
A NaN 20.0 NaN NaN
B 10.0 NaN NaN NaN

Categories

Resources