Why is pandas str.replace returning NaN? - python

I am trying to remove the comma separator from values in a dataframe in Pandas to enable me to convert the to Integers. I have been using the following method:
df_orders['qty'] = df_orders['qty'].str.replace(',','')
However this seems to be returning NaN values for some numbers which did not originally contain ',' in their values. I have included a sample of my Input data and current output below:
Input:
date sku qty
556603 2020-10-25 A 6
590904 2020-10-21 A 5
595307 2020-10-20 A 31
602678 2020-10-19 A 11
615022 2020-10-18 A 2
641077 2020-10-16 A 1
650203 2020-10-15 A 3
655363 2020-10-14 A 18
667919 2020-10-13 A 5
674990 2020-10-12 A 2
703901 2020-10-09 A 1
715411 2020-10-08 A 1
721557 2020-10-07 A 31
740515 2020-10-06 A 49
752670 2020-10-05 A 4
808426 2020-09-28 A 2
848057 2020-09-23 A 1
865751 2020-09-21 A 2
886630 2020-09-18 A 3
901095 2020-09-16 A 47
938648 2020-09-10 A 2
969909 2020-09-07 A 3
1021548 2020-08-31 A 2
1032254 2020-08-30 A 8
1077443 2020-08-25 A 5
1089670 2020-08-24 A 24
1098843 2020-08-23 A 16
1102025 2020-08-22 A 23
1179347 2020-08-12 A 1
1305700 2020-07-29 A 1
1316343 2020-07-28 A 1
1399930 2020-07-19 A 1
1451864 2020-07-15 A 1
1463195 2020-07-14 A 15
2129080 2020-05-19 A 1
2143468 2020-05-18 A 1
Current Output:
date sku qty
556603 2020-10-25 A 6
590904 2020-10-21 A 5
595307 2020-10-20 A 31
602678 2020-10-19 A 11
615022 2020-10-18 A 2
641077 2020-10-16 A 1
650203 2020-10-15 A 3
655363 2020-10-14 A NaN
667919 2020-10-13 A NaN
674990 2020-10-12 A NaN
703901 2020-10-09 A NaN
715411 2020-10-08 A NaN
721557 2020-10-07 A NaN
740515 2020-10-06 A NaN
752670 2020-10-05 A NaN
808426 2020-09-28 A 2
848057 2020-09-23 A 1
865751 2020-09-21 A 2
886630 2020-09-18 A 3
901095 2020-09-16 A 47
938648 2020-09-10 A NaN
969909 2020-09-07 A NaN
1021548 2020-08-31 A NaN
1032254 2020-08-30 A NaN
1077443 2020-08-25 A NaN
1089670 2020-08-24 A NaN
1098843 2020-08-23 A NaN
1102025 2020-08-22 A NaN
1179347 2020-08-12 A NaN
1305700 2020-07-29 A NaN
1316343 2020-07-28 A 1
1399930 2020-07-19 A 1
1451864 2020-07-15 A 1
1463195 2020-07-14 A 15
2129080 2020-05-19 A 1
2143468 2020-05-18 A 1
I have had a look around but can't seem to find what is causing this error.

I was able to reproduce your issue:
# toy df
df
qty
0 1
1 2,
2 3
df['qty'].str.replace(',', '')
0 NaN
1 2
2 NaN
Name: qty, dtype: object
I created df by doing this:
df = pd.DataFrame({'qty': [1, '2,', 3]})
In other words, your column has mixed data types - some values are integers while others are strings. So when you apply .str methods on mixed types, non str types are converted to NaN to indicate "hey it doesn't make sense to run a str method on an int".
You may fix this by converting the entire column to string, then back to int:
df['qty'].astype(str).str.replace(',', '').astype(int)
Or if you want something a litte more robust, try
df['qty'] = pd.to_numeric(
df['qty'].astype(str).str.extract('(\d+)', expand=False), errors='coerce')

Related

Python to increment date every week in a dataframe

I am trying to work on this requirement where I need to increment the date in weeks, here is the below code for the same:
import pandas as pd
import numpy as np
c=15
s={'week':[1,2,3,4,5,6,7,8],'Sales':[10,20,30,40,50,60,70,80]}
p=pd.DataFrame(data=s)
p['week'] =p['week'].apply(
lambda x: datetime.datetime.strptime(f'2021-{x:02}-1', '%Y-%U-%u')
)
O/P-
How would I be able to increment from last row of week column to get next 15 weeks?
Basically, the desired output of week starts from 2022-03-01 till the next 14 weeks.
One option is to use date_range to generate additional dates, then use set_index + reindex to append them:
p = p.set_index('week').reindex(pd.date_range('2021-01-04', periods=8+14, freq='W-MON')).rename_axis(['week']).reset_index()
Output:
week Sales
0 2021-01-04 10.0
1 2021-01-11 20.0
2 2021-01-18 30.0
3 2021-01-25 40.0
4 2021-02-01 50.0
5 2021-02-08 60.0
6 2021-02-15 70.0
7 2021-02-22 80.0
8 2021-03-01 NaN
9 2021-03-08 NaN
10 2021-03-15 NaN
11 2021-03-22 NaN
12 2021-03-29 NaN
13 2021-04-05 NaN
14 2021-04-12 NaN
15 2021-04-19 NaN
16 2021-04-26 NaN
17 2021-05-03 NaN
18 2021-05-10 NaN
19 2021-05-17 NaN
20 2021-05-24 NaN
21 2021-05-31 NaN
You can modify the length of of week list with range() function and your variable c, but you will also check for the length of sales, which has to have the same number of elements:
import pandas as pd
import numpy as np
import datetime
c=15
weeks = list(range(1, c+1))
sales = [10,20,30,40,50,60,70,80]
s={'week':weeks,'Sales':sales+[None]*(len(weeks)-len(sales) if (len(weeks)-len(sales)) >=0 else 0)}
p=pd.DataFrame(data=s)
p['week'] =p['week'].apply(
lambda x: datetime.datetime.strptime(f'2021-{x:02}-1', '%Y-%U-%u')
)
print(p)
another option in DateOffset:
p = pd.concat([p, pd.DataFrame({'week': [p.iloc[-1,0]+pd.DateOffset(weeks=i) for i in range(1,c)]})], ignore_index=True)
>>> p
'''
week Sales
0 2021-01-04 10.0
1 2021-01-11 20.0
2 2021-01-18 30.0
3 2021-01-25 40.0
4 2021-02-01 50.0
5 2021-02-08 60.0
6 2021-02-15 70.0
7 2021-02-22 80.0
8 2021-03-01 NaN
9 2021-03-08 NaN
10 2021-03-15 NaN
11 2021-03-22 NaN
12 2021-03-29 NaN
13 2021-04-05 NaN
14 2021-04-12 NaN
15 2021-04-19 NaN
16 2021-04-26 NaN
17 2021-05-03 NaN
18 2021-05-10 NaN
19 2021-05-17 NaN
20 2021-05-24 NaN
21 2021-05-31 NaN

Upsample and interpolate in multivariate time series pandas [duplicate]

I am trying to work on this requirement where I need to increment the date in weeks, here is the below code for the same:
import pandas as pd
import numpy as np
c=15
s={'week':[1,2,3,4,5,6,7,8],'Sales':[10,20,30,40,50,60,70,80]}
p=pd.DataFrame(data=s)
p['week'] =p['week'].apply(
lambda x: datetime.datetime.strptime(f'2021-{x:02}-1', '%Y-%U-%u')
)
O/P-
How would I be able to increment from last row of week column to get next 15 weeks?
Basically, the desired output of week starts from 2022-03-01 till the next 14 weeks.
One option is to use date_range to generate additional dates, then use set_index + reindex to append them:
p = p.set_index('week').reindex(pd.date_range('2021-01-04', periods=8+14, freq='W-MON')).rename_axis(['week']).reset_index()
Output:
week Sales
0 2021-01-04 10.0
1 2021-01-11 20.0
2 2021-01-18 30.0
3 2021-01-25 40.0
4 2021-02-01 50.0
5 2021-02-08 60.0
6 2021-02-15 70.0
7 2021-02-22 80.0
8 2021-03-01 NaN
9 2021-03-08 NaN
10 2021-03-15 NaN
11 2021-03-22 NaN
12 2021-03-29 NaN
13 2021-04-05 NaN
14 2021-04-12 NaN
15 2021-04-19 NaN
16 2021-04-26 NaN
17 2021-05-03 NaN
18 2021-05-10 NaN
19 2021-05-17 NaN
20 2021-05-24 NaN
21 2021-05-31 NaN
You can modify the length of of week list with range() function and your variable c, but you will also check for the length of sales, which has to have the same number of elements:
import pandas as pd
import numpy as np
import datetime
c=15
weeks = list(range(1, c+1))
sales = [10,20,30,40,50,60,70,80]
s={'week':weeks,'Sales':sales+[None]*(len(weeks)-len(sales) if (len(weeks)-len(sales)) >=0 else 0)}
p=pd.DataFrame(data=s)
p['week'] =p['week'].apply(
lambda x: datetime.datetime.strptime(f'2021-{x:02}-1', '%Y-%U-%u')
)
print(p)
another option in DateOffset:
p = pd.concat([p, pd.DataFrame({'week': [p.iloc[-1,0]+pd.DateOffset(weeks=i) for i in range(1,c)]})], ignore_index=True)
>>> p
'''
week Sales
0 2021-01-04 10.0
1 2021-01-11 20.0
2 2021-01-18 30.0
3 2021-01-25 40.0
4 2021-02-01 50.0
5 2021-02-08 60.0
6 2021-02-15 70.0
7 2021-02-22 80.0
8 2021-03-01 NaN
9 2021-03-08 NaN
10 2021-03-15 NaN
11 2021-03-22 NaN
12 2021-03-29 NaN
13 2021-04-05 NaN
14 2021-04-12 NaN
15 2021-04-19 NaN
16 2021-04-26 NaN
17 2021-05-03 NaN
18 2021-05-10 NaN
19 2021-05-17 NaN
20 2021-05-24 NaN
21 2021-05-31 NaN

idxmax() returns index instead of timestaped

a python beginner here,
I am trying to get the highest price of a particular stock per month, and what date the maximum value occurred.
Getting the maximum value per month is okay using max()
but when I'm trying get the corresponding dates of the max price using idxmax(), my code returns the corresponding index instead of date. My code looks like this:
Max_Date = Daily_High.groupby(pd.Grouper(key="Date", freq="M")).High.idxmax()
Output
Date High
0 2020-04-30 9929
1 2020-05-31 9946
2 2020-06-30 9966
3 2020-07-31 9993
4 2020-08-31 10014
5 2020-09-30 10016
6 2020-10-31 10044
7 2020-11-30 10063
8 2020-12-31 10097
9 2021-01-31 10114
10 2021-02-28 10125
11 2021-03-31 10139
12 2021-04-30 10180
13 2021-05-31 10182
Output Should be like this
Date High Max Date
0 2020-04-30 2020-04-30
1 2020-05-31 2020-05-26
2 2020-06-30 2020-06-23
3 2020-07-31 2020-07-31
4 2020-08-31 2020-08-31
5 2020-09-30 2020-09-02
6 2020-10-31 2020-10-13
7 2020-11-30 2020-11-09
8 2020-12-31 2020-12-29
9 2021-01-31 2021-01-25
10 2021-02-28 2021-02-09
11 2021-03-31 2021-03-02
12 2021-04-30 2021-04-29
13 2021-05-31 2021-05-03
Hope you can help me to get the correct date. Thank you!
Create DatetimeIndex and remove key="Date" from pd.Grouper:
Max_Date = Daily_High.set_index('Date').groupby(pd.Grouper( freq="M")).High.idxmax()

How do I concat these two data frames together to get them to overlap and fit with the date?

I want them to fit evenly together instead of stacking on eachother. Also the date isn't aligned perfectly and I'm quite stuck on how to work around that. If possible a 0 on the 12/31 of capital stock would be great. thanks
gos_dataset = pd.DataFrame({'Date':gosd, 'Outstanding Shares':gos}, columns = ['Date', 'Outstanding Shares'])
gcs_dataset = pd.DataFrame({'Date':gcsd, 'Capital Stock':gcs}, columns = ['Date', 'Capital Stock'])
print(pd.concat([gcs_dataset, gos_dataset]))
Date Capital Stock Outstanding Shares
0 2020-01-02 7251.39 NaN
1 2020-01-03 47200.86 NaN
2 2020-01-06 119020.28 NaN
3 2020-01-07 11751250.39 NaN
4 2020-01-08 4790267.25 NaN
5 2020-01-09 -54597.18 NaN
6 2020-01-10 -46410.80 NaN
7 2020-01-13 78669.05 NaN
8 2020-01-14 150819.02 NaN
9 2020-01-15 -23295.45 NaN
10 2020-01-16 87836.67 NaN
11 2020-01-17 6346.19 NaN
12 2020-01-21 10304.31 NaN
13 2020-01-22 -335114.92 NaN
14 2020-01-23 94276.75 NaN
15 2020-01-24 -38526.78 NaN
16 2020-01-27 9998.97 NaN
17 2020-01-28 357659.16 NaN
18 2020-01-29 5487.23 NaN
19 2020-01-30 143213.17 NaN
20 2020-01-31 -25900.72 NaN
0 2019-12-31 NaN 3693737.147
1 2020-01-02 NaN 706.570
2 2020-01-03 NaN 4718.445
3 2020-01-06 NaN 11964.175
4 2020-01-07 NaN 1179829.280
5 2020-01-08 NaN 481078.653
6 2020-01-09 NaN -5471.248
7 2020-01-10 NaN -4629.751
8 2020-01-13 NaN 7812.787
9 2020-01-14 NaN 15096.288
10 2020-01-15 NaN -2314.353
11 2020-01-16 NaN 8753.650
12 2020-01-17 NaN 683.555
13 2020-01-21 NaN 1023.227
14 2020-01-22 NaN -33172.984
15 2020-01-23 NaN 8838.869
16 2020-01-24 NaN -3351.471
17 2020-01-27 NaN 1001.065
18 2020-01-28 NaN 35921.377
19 2020-01-29 NaN 549.450
20 2020-01-30 NaN 14307.865
21 2020-01-31 NaN -2585.328

Merging multiple dataframe using month datetime

I have three dataframes. Each dataframe has date as column. I want to left join the three using date column. Date are present in the form 'yyyy-mm-dd'. I want to merge the dataframe using 'yyyy-mm' only.
df1
Date X
31-05-2014 1
30-06-2014 2
31-07-2014 3
31-08-2014 4
30-09-2014 5
31-10-2014 6
30-11-2014 7
31-12-2014 8
31-01-2015 1
28-02-2015 3
31-03-2015 4
30-04-2015 5
df2
Date Y
01-09-2014 1
01-10-2014 4
01-11-2014 6
01-12-2014 7
01-01-2015 2
01-02-2015 3
01-03-2015 6
01-04-2015 4
01-05-2015 3
01-06-2015 4
01-07-2015 5
01-08-2015 2
df3
Date Z
01-07-2015 9
01-08-2015 2
01-09-2015 4
01-10-2015 1
01-11-2015 2
01-12-2015 3
01-01-2016 7
01-02-2016 4
01-03-2016 9
01-04-2016 2
01-05-2016 4
01-06-2016 1
Try:
df4 = pd.merge(df1,df2, how='left', on='Date')
Result:
Date X Y
0 2014-05-31 1 NaN
1 2014-06-30 2 NaN
2 2014-07-31 3 NaN
3 2014-08-31 4 NaN
4 2014-09-30 5 NaN
5 2014-10-31 6 NaN
6 2014-11-30 7 NaN
7 2014-12-31 8 NaN
8 2015-01-31 1 NaN
9 2015-02-28 3 NaN
10 2015-03-31 4 NaN
11 2015-04-30 5 NaN
Use Series.dt.to_period with months periods and merge by multiple DataFrames in list:
import functools
dfs = [df1, df2, df3]
dfs = [x.assign(per=x['Date'].dt.to_period('m')) for x in dfs]
df = functools.reduce(lambda left,right: pd.merge(left,right,on='per', how='left'), dfs)
print (df)
Date_x X per Date_y Y Date Z
0 2014-05-31 1 2014-05 NaT NaN NaT NaN
1 2014-06-30 2 2014-06 NaT NaN NaT NaN
2 2014-07-31 3 2014-07 NaT NaN NaT NaN
3 2014-08-31 4 2014-08 NaT NaN NaT NaN
4 2014-09-30 5 2014-09 2014-09-01 1.0 NaT NaN
5 2014-10-31 6 2014-10 2014-10-01 4.0 NaT NaN
6 2014-11-30 7 2014-11 2014-11-01 6.0 NaT NaN
7 2014-12-31 8 2014-12 2014-12-01 7.0 NaT NaN
8 2015-01-31 1 2015-01 2015-01-01 2.0 NaT NaN
9 2015-02-28 3 2015-02 2015-02-01 3.0 NaT NaN
10 2015-03-31 4 2015-03 2015-03-01 6.0 NaT NaN
11 2015-04-30 5 2015-04 2015-04-01 4.0 NaT NaN
Alternative:
df1['per'] = df1['Date'].dt.to_period('m')
df2['per'] = df2['Date'].dt.to_period('m')
df3['per'] = df3['Date'].dt.to_period('m')
df4 = pd.merge(df1,df2, how='left', on='per').merge(df3, how='left', on='per')
print (df4)
Date_x X per Date_y Y Date Z
0 2014-05-31 1 2014-05 NaT NaN NaT NaN
1 2014-06-30 2 2014-06 NaT NaN NaT NaN
2 2014-07-31 3 2014-07 NaT NaN NaT NaN
3 2014-08-31 4 2014-08 NaT NaN NaT NaN
4 2014-09-30 5 2014-09 2014-09-01 1.0 NaT NaN
5 2014-10-31 6 2014-10 2014-10-01 4.0 NaT NaN
6 2014-11-30 7 2014-11 2014-11-01 6.0 NaT NaN
7 2014-12-31 8 2014-12 2014-12-01 7.0 NaT NaN
8 2015-01-31 1 2015-01 2015-01-01 2.0 NaT NaN
9 2015-02-28 3 2015-02 2015-02-01 3.0 NaT NaN
10 2015-03-31 4 2015-03 2015-03-01 6.0 NaT NaN
11 2015-04-30 5 2015-04 2015-04-01 4.0 NaT NaN

Categories

Resources