Python Pandas dataframe - add a new column based on index value - python

I have a Python pandas dataframe that looks like this:
year_2000 year_1999 year_1998 year_1997 year_1996 year_1995 (MANH, stock name)
MANH 454.47 -71.90 nan nan nan nan TEST
LH 385.52 180.95 -24.14 -41.67 -68.92 -26.47 TEST
DGX 373.33 68.04 4.01 nan nan nan TEST
SKX 306.56 nan nan nan nan nan TEST
where the stock tickers are the index. I want to add the name of each stock as a new column
I tried adding the stock name column via yearly_best['MANH','stock name']='TEST' but it adds the same name in all rows.
I have a dictionary called ticker_name which contains the tickers and the names
Out[65]:
{'TWOU': '2U',
'MMM': '3M',
'ABT': 'Abbott Laboratories',
'ABBV': 'AbbVie Inc.',
'ABMD': 'Abiomed',
'ACHC': 'Acadia Healthcare',
'ACN': 'Accenture',
'ATVI': 'Activision Blizzard',
'AYI': 'Acuity Brands',
'ADNT': 'Adient',
thus I would like to get the names from the dict and put then in a column in the dataframe. How can I do that?

As the key of your dictionnary are index of your dataFrame you can try:
d = {'TWOU': '2U',
'MMM': '3M',
'ABT': 'Abbott Laboratories',
'ABBV': 'AbbVie Inc.',
'ABMD': 'Abiomed',
'ACHC': 'Acadia Healthcare',
'ACN': 'Accenture',
'ATVI': 'Activision Blizzard',
'AYI': 'Acuity Brands',
'ADNT': 'Adient',}
df['stock name'] = pd.Series(d)

You can try:
# Create a new column "stock_name" with the index values
yearly_best['stock_name'] = yearly_best.index
# Replace the "stock_name" values based on the dictionary
yearly_best['stock_name'].map(ticker_name, inplace=True)
Note that in this case, the dataframe's indices will remain as they were (stock tickers). If you would like to replace the indices with row numbers, consider using reset_index

Related

How to give value to pandas column from other column on condition

I am cleaning my dataset and just in one column (dtype obj) has NaN that I want to convert / transform with the same values given by other variable (obj).
Do you know how can I transform those NaN without overwriting the non-NaN values.
Here is an example of I would like to do:
in the areas where there is NaN values I want to set the values of region, in this particular case 'NaN' = 'Europe' and 'NaN' = 'Africa'
Region
Area
USA
NY
Europe
Berlin
Asia
Beijin
Europe
NaN
Africa
NaN
I tried using a for loop: but i guess is wrong
Area_type = df['Area']
def Area_type (x):
for i in Area_type:
if i == "NaN":
i = df['Region']
else:
pass
return Area_type
Thanks a lot
You can instruct pandas to change the value of a column in a subset of records that match a condition by using loc:
df.loc[df["Area"].isna(), "Area"] = df["Region"]
If NaN values are strings, use this:
df.loc[df["Area"] == "NaN", "Area"] = df["Region"]

From a mixed dtype column, extract string from specific column values in python pandas

My dataframe looks like below:
data = {'pred_id':[np.nan, np.nan, 'Pred ID', 258,265,595,658],
'class':[np.nan,np.nan,np.nan,'pork','sausage','chicken','pork'],
'image':['Weight',115.37,'pred','app_images/03112020/Prediction/222_prediction_resized.jpg','app_images/03112020/Prediction/333_prediction_resized.jpg','volume',np.nan]}
df = pd.DataFrame(data)
df
Edited:
I am trying create a new column 'image_name' with values from column 'image'. I want to extract a substring from column 'image' values that contains 'app_images/' in its string, and if not then keep it the same.
I tried the below code and its throwing 'Attribute Error' message.
Help me on how to find the dtype and then extract substring from values that have 'app_images/' and if not then keep the value as it is. I dont know how to fix this. Thanks in advance.
images = []
for i in df['image']:
if i.dtypes == object:
if i.__contains__('app_images/'):
new = i.split('_')[1]
name = new.split('/')[3]+'.jpg'
images.append(name)
else:
images.append(i)
df['image_name'] = images
df
Do not use a loop, use vectorial code, str.extract and a regex.
From your description and code, this seems to be what you expect:
df['image_name'] = (df['image'].str.extract(r'app_images/.*/(\d+)_[^/]+\.jpg',
expand=False)+'.jpg'
)
output:
pred_id class image image_name
0 NaN NaN Weight NaN
1 NaN NaN 115.37 NaN
2 Pred ID NaN pred NaN
3 258 pork app_images/03112020/Prediction/222_prediction_resized.jpg 222.jpg
4 265 sausage app_images/03112020/Prediction/333_prediction_resized.jpg 333.jpg
5 595 chicken volume NaN
6 658 pork NaN NaN
regex demo

Using regex to create new column in dataframe

I have a dataframe and in one of its columns i need to pull out specific text and place it into its own column. From the dataframe below i need to take elements of the LAUNCH column and add that into its own column next to it, specifically i need to extract the date in the rows which provide it, for example 'Mar-24'.
df =
|LAUNCH
0|Step-up Mar-24:x1.5
1|unknown
2|NTV:62.1%
3|Step-up Aug-23:N/A,
I would like the output to be something like this:
df =
|LAUNCH |DATE
0|Step-up Mar-24:x1.5 | Mar-24
1|unknown | nan
2|NTV:62.1% | nan
3|Step-up Aug-23:N/A, | Aug-23
And if this can be done, would it also be possible to display the date as something like 24-03-01 (yyyy-mm-dd) rather than Mar-24.
One way is to use str.extract, looking for any match on day of the month:
months = (pd.to_datetime(pd.Series([*range(1,12)]), format='%m')
.dt.month_name()
.str[:3]
.values.tolist())
pat = rf"((?:{'|'.join(months)})-\d+)"
# '((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov)-\\d+)'
df['DATE '] = df.LAUNCH.str.extract(pat)
print(df)
LAUNCH DATE
0 Step-up Mar-24:x1.5 Mar-24
1 unknown NaN
2 NTV:62.1% NaN
3 Step-up Aug-23:N/A Aug-23
Use str.extract with a named capturing group.
The code to add a new column with the extracting result can be e.g.:
df = pd.concat([df, df.LAUNCH.str.extract(
r'(?P<DATE>(?:Jan|Feb|Ma[ry]|Apr|Ju[nl]|Aug|Sep|Oct|Nov|Dec)-\d{2})')],
axis=1, sort=False)
The result, for your data, is:
LAUNCH DATE
0 Step-up Mar-24:x1.5 Mar-24
1 unknown NaN
2 NTV:62.1% NaN
3 Step-up Aug-23:N/A, Aug-23

looping into dates and apply function to pandas dataframe

I'm trying to detect the first dates when an event occur: here in my dataframe for the product A (see pivot table) I have 20 items stored for the first time on 2017-04-03.
so I want to create a new variable calle new_var_2017-04-03 that store the increment. On the other hand on the next day 2017-04-04 I don't mind if the item is now 50 instead of 20, I only want to store only the 1st event
It gives me several errors, I would like to know at least if the entire logic behind it makes sense, it's "pythonic", or if I'm completeley on the wrong way
raw_data = {'name': ['B','A','A','B'],'date' : pd.to_datetime(pd.Series(['2017-03-30','2017-03-31','2017-04-03','2017-04-04'])),
'age': [10,20,50,30]}
df1 = pd.DataFrame(raw_data, columns = ['date','name','age'])
table=pd.pivot_table(df1,index=['name'],columns=['date'],values=['age'],aggfunc='sum')
table
I'm passing the dates to a list
dates=df1['date'].values.tolist()
I want to do a backward loop into my list "dates" and create a variable if an event occurs.
pseudo code: with i-1 I mean the item before i in the list
def my_fun(x,list):
for i in reversed(list):
if (x[i]-x[i-1])>0 :
x[new_var+i]=x[i]-x[i-1]
else:
x[new_var+i]=0
return x
print (df.apply(lambda x: my_fun(x,dates), axis=1))
desidered output:
raw_data2 = {'new_var': ['new_var_2017-03-30','new_var_2017-03-31','new_var_2017-04-03','new_var_2017-04-04'],'result_a': [np.nan,20,np.nan,np.nan],'result_b': [10,np.nan,np.nan,np.nan]}
df2= pd.DataFrame(raw_data2, columns = ['new_var','result_a','result_b'])
df2.T
Let's try this:
df1['age'] = df1.groupby('name')['age'].transform(lambda x: (x==x.min())*x)
df1.pivot_table(index='name', columns='date', values='age').replace(0,np.nan)
date 2017-03-30 2017-03-31 2017-04-03 2017-04-04
name
A NaN 20.0 NaN NaN
B 10.0 NaN NaN NaN

Extract series objects from Pandas DataFrame

I have a dataframe with the columns
['CPL4', 'Part Number', 'Calendar Year/Month', 'Sales', 'Inventory']
For each 'Part Number', 'Calendar Year/Month' will be unique on each Part Number.
I want to convert each part number to a univariate Series with 'Calendar Year/Month' as the index and either 'Sales' or 'Inventory' as the value.
How can I accomplish this using pandas built-in functions and not iterating through the dataframe manually?
In pandas this is called a MultiIndex. Try:
import pandas as pd
df = pd.DataFrame(file,
index=['Part Number', 'Calendar Year/Month'],
columns = ['Sales', 'Inventory'])
you can use the groupby method such has:
grouped_df = df.groupby('Part Number')
and then you can access the df of a certain part number and set the index easily such has:
new_df = grouped_df.get_group('THEPARTNUMBERYOUWANT').set_index('Calendar Year/Month')
if you only want the 2 columns you can do:
print new_df[['Sales', 'Inventory']]]
From the answers and comments here, along with a little more research, I ended with the following solution.
temp_series = df[df[ "Part Number" == sku ] ].pivot(columns = ["Calendar Year/Month"], values = "Sales").iloc[0]
Where sku is a specific part number from df["Part Number"].unique()
This will give you a univariate time series(temp_series) indexed by "Calendar Year/Month" with values of "Sales" EG:
1.2015 NaN
1.2016 NaN
2.2015 NaN
2.2016 NaN
3.2015 NaN
3.2016 NaN
4.2015 NaN
4.2016 NaN
5.2015 NaN
5.2016 NaN
6.2015 NaN
6.2016 NaN
7.2015 NaN
7.2016 NaN
8.2015 NaN
8.2016 NaN
9.2015 NaN
10.2015 NaN
11.2015 NaN
12.2015 NaN
Name: 161, dtype: float64
<class 'pandas.core.series.Series'>])
from the columns
['CPL4', 'Part Number', 'Calendar Year/Month', 'Sales', 'Inventory']

Categories

Resources