How to collapse rows in dataframe by similar columns

How to collapse rows in dataframe by similar columns - python

I would like to collapse the following:
Date Category input1 input2
2019-11-08 1 NaN 182.420781
2019-12-09 1 NaN 174.251870
2020-01-08 1 NaN 186.296325
2019-11-08 1 177.670203 NaN
2019-12-09 1 177.001475 NaN
2020-01-08 1 179.940017 NaN
2019-11-08 2 NaN 84.369389
2019-12-09 2 NaN 87.882385
2020-01-08 2 NaN 86.309750
2019-11-08 2 83.995045 NaN
2019-12-09 2 86.166011 NaN
2020-01-08 2 89.449188 NaN
2019-11-08 3 NaN 83.878360
2019-12-09 3 NaN 90.910188
2020-01-08 3 NaN 93.120330
2019-11-08 3 84.010900 NaN
2019-12-09 3 86.916081 NaN
2020-01-08 3 91.620387 NaN
into:
Date Category input1 input2
2019-11-08 1 177.670203 182.420781
2019-12-09 1 177.001475 174.251870
2020-01-08 1 179.940017 186.296325
2019-11-08 2 83.995045 84.369389
2019-12-09 2 86.166011 87.882385
2020-01-08 2 89.449188 86.309750
2019-11-08 3 84.010900 83.878360
2019-12-09 3 86.916081 90.910188
2020-01-08 3 91.620387 93.120330
I've tried looking to agg, join, etc but I simply don't have enough knowledge to do what I need. Essentially, the inputs are repeated by Date and Category, so I would just like to collapse them all into the same respective rows.

Let us try groupby with first : it will return first not null value
s = df.groupby(['Category','Date'],as_index=False).first()
s
Category Date input1 input2
0 1 2019-11-08 177.670203 182.420781
1 1 2019-12-09 177.001475 174.251870
2 1 2020-01-08 179.940017 186.296325
3 2 2019-11-08 83.995045 84.369389
4 2 2019-12-09 86.166011 87.882385
5 2 2020-01-08 89.449188 86.309750
6 3 2019-11-08 84.010900 83.878360
7 3 2019-12-09 86.916081 90.910188
8 3 2020-01-08 91.620387 93.120330

Related

Why is pandas str.replace returning NaN?

I am trying to remove the comma separator from values in a dataframe in Pandas to enable me to convert the to Integers. I have been using the following method:
df_orders['qty'] = df_orders['qty'].str.replace(',','')
However this seems to be returning NaN values for some numbers which did not originally contain ',' in their values. I have included a sample of my Input data and current output below:
Input:
date sku qty
556603 2020-10-25 A 6
590904 2020-10-21 A 5
595307 2020-10-20 A 31
602678 2020-10-19 A 11
615022 2020-10-18 A 2
641077 2020-10-16 A 1
650203 2020-10-15 A 3
655363 2020-10-14 A 18
667919 2020-10-13 A 5
674990 2020-10-12 A 2
703901 2020-10-09 A 1
715411 2020-10-08 A 1
721557 2020-10-07 A 31
740515 2020-10-06 A 49
752670 2020-10-05 A 4
808426 2020-09-28 A 2
848057 2020-09-23 A 1
865751 2020-09-21 A 2
886630 2020-09-18 A 3
901095 2020-09-16 A 47
938648 2020-09-10 A 2
969909 2020-09-07 A 3
1021548 2020-08-31 A 2
1032254 2020-08-30 A 8
1077443 2020-08-25 A 5
1089670 2020-08-24 A 24
1098843 2020-08-23 A 16
1102025 2020-08-22 A 23
1179347 2020-08-12 A 1
1305700 2020-07-29 A 1
1316343 2020-07-28 A 1
1399930 2020-07-19 A 1
1451864 2020-07-15 A 1
1463195 2020-07-14 A 15
2129080 2020-05-19 A 1
2143468 2020-05-18 A 1
Current Output:
date sku qty
556603 2020-10-25 A 6
590904 2020-10-21 A 5
595307 2020-10-20 A 31
602678 2020-10-19 A 11
615022 2020-10-18 A 2
641077 2020-10-16 A 1
650203 2020-10-15 A 3
655363 2020-10-14 A NaN
667919 2020-10-13 A NaN
674990 2020-10-12 A NaN
703901 2020-10-09 A NaN
715411 2020-10-08 A NaN
721557 2020-10-07 A NaN
740515 2020-10-06 A NaN
752670 2020-10-05 A NaN
808426 2020-09-28 A 2
848057 2020-09-23 A 1
865751 2020-09-21 A 2
886630 2020-09-18 A 3
901095 2020-09-16 A 47
938648 2020-09-10 A NaN
969909 2020-09-07 A NaN
1021548 2020-08-31 A NaN
1032254 2020-08-30 A NaN
1077443 2020-08-25 A NaN
1089670 2020-08-24 A NaN
1098843 2020-08-23 A NaN
1102025 2020-08-22 A NaN
1179347 2020-08-12 A NaN
1305700 2020-07-29 A NaN
1316343 2020-07-28 A 1
1399930 2020-07-19 A 1
1451864 2020-07-15 A 1
1463195 2020-07-14 A 15
2129080 2020-05-19 A 1
2143468 2020-05-18 A 1
I have had a look around but can't seem to find what is causing this error.

I was able to reproduce your issue:
# toy df
df
qty
0 1
1 2,
2 3
df['qty'].str.replace(',', '')
0 NaN
1 2
2 NaN
Name: qty, dtype: object
I created df by doing this:
df = pd.DataFrame({'qty': [1, '2,', 3]})
In other words, your column has mixed data types - some values are integers while others are strings. So when you apply .str methods on mixed types, non str types are converted to NaN to indicate "hey it doesn't make sense to run a str method on an int".
You may fix this by converting the entire column to string, then back to int:
df['qty'].astype(str).str.replace(',', '').astype(int)
Or if you want something a litte more robust, try
df['qty'] = pd.to_numeric(
df['qty'].astype(str).str.extract('(\d+)', expand=False), errors='coerce')

How do I concat these two data frames together to get them to overlap and fit with the date?

I want them to fit evenly together instead of stacking on eachother. Also the date isn't aligned perfectly and I'm quite stuck on how to work around that. If possible a 0 on the 12/31 of capital stock would be great. thanks
gos_dataset = pd.DataFrame({'Date':gosd, 'Outstanding Shares':gos}, columns = ['Date', 'Outstanding Shares'])
gcs_dataset = pd.DataFrame({'Date':gcsd, 'Capital Stock':gcs}, columns = ['Date', 'Capital Stock'])
print(pd.concat([gcs_dataset, gos_dataset]))
Date Capital Stock Outstanding Shares
0 2020-01-02 7251.39 NaN
1 2020-01-03 47200.86 NaN
2 2020-01-06 119020.28 NaN
3 2020-01-07 11751250.39 NaN
4 2020-01-08 4790267.25 NaN
5 2020-01-09 -54597.18 NaN
6 2020-01-10 -46410.80 NaN
7 2020-01-13 78669.05 NaN
8 2020-01-14 150819.02 NaN
9 2020-01-15 -23295.45 NaN
10 2020-01-16 87836.67 NaN
11 2020-01-17 6346.19 NaN
12 2020-01-21 10304.31 NaN
13 2020-01-22 -335114.92 NaN
14 2020-01-23 94276.75 NaN
15 2020-01-24 -38526.78 NaN
16 2020-01-27 9998.97 NaN
17 2020-01-28 357659.16 NaN
18 2020-01-29 5487.23 NaN
19 2020-01-30 143213.17 NaN
20 2020-01-31 -25900.72 NaN
0 2019-12-31 NaN 3693737.147
1 2020-01-02 NaN 706.570
2 2020-01-03 NaN 4718.445
3 2020-01-06 NaN 11964.175
4 2020-01-07 NaN 1179829.280
5 2020-01-08 NaN 481078.653
6 2020-01-09 NaN -5471.248
7 2020-01-10 NaN -4629.751
8 2020-01-13 NaN 7812.787
9 2020-01-14 NaN 15096.288
10 2020-01-15 NaN -2314.353
11 2020-01-16 NaN 8753.650
12 2020-01-17 NaN 683.555
13 2020-01-21 NaN 1023.227
14 2020-01-22 NaN -33172.984
15 2020-01-23 NaN 8838.869
16 2020-01-24 NaN -3351.471
17 2020-01-27 NaN 1001.065
18 2020-01-28 NaN 35921.377
19 2020-01-29 NaN 549.450
20 2020-01-30 NaN 14307.865
21 2020-01-31 NaN -2585.328

How can I split this excel file into two data frames?

When I try and load this excel spreadsheet into a dataframe I get a lot of NAN due to all the random white space in the file. I'd really like to split class I and class A from this excel file into two seperate pandas dataframe
In:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
excel_file = 'EXAMPLE.xlsx'
df = pd.read_excel(excel_file, header=8)
print(df)
sys.exit()
Out:
Class I Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Class A Unnamed: 9 Unnamed: 10 Unnamed: 11 Unnamed: 12
0 Date NaN column 1 NaN column 2 NaN NaN NaN Date NaN column 1 NaN column 2
1 2019-12-31 00:00:00 NaN 1 NaN A NaN NaN NaN 2019-12-31 00:00:00 NaN A NaN 1
2 2020-01-01 00:00:00 NaN 2 NaN B NaN NaN NaN 2020-01-01 00:00:00 NaN B NaN 2
3 2020-01-02 00:00:00 NaN 3 NaN C NaN NaN NaN 2020-01-02 00:00:00 NaN C NaN 3
4 2020-01-03 00:00:00 NaN 4 NaN D NaN NaN NaN 2020-01-03 00:00:00 NaN D NaN 4
5 2020-01-04 00:00:00 NaN 5 NaN E NaN NaN NaN 2020-01-04 00:00:00 NaN E NaN 5
6 2020-01-05 00:00:00 NaN 6 NaN F NaN NaN NaN 2020-01-05 00:00:00 NaN F NaN 6
7 2020-01-06 00:00:00 NaN 7 NaN G NaN NaN NaN 2020-01-06 00:00:00 NaN G NaN 7
8 2020-01-07 00:00:00 NaN 8 NaN H NaN NaN NaN 2020-01-07 00:00:00 NaN H NaN 8

Try to use the parameter usecols. From the documentation:
If list of int, then indicates list of column numbers to be parsed.
import pandas as pd
df1 = pd.read_excel(excel_file,usecols=[0,2,4])
df2 = pd.read_excel(excel_file,usecols=[8,10,12])
This should create two dataframes with the columns you want.

Merging multiple dataframe using month datetime

I have three dataframes. Each dataframe has date as column. I want to left join the three using date column. Date are present in the form 'yyyy-mm-dd'. I want to merge the dataframe using 'yyyy-mm' only.
df1
Date X
31-05-2014 1
30-06-2014 2
31-07-2014 3
31-08-2014 4
30-09-2014 5
31-10-2014 6
30-11-2014 7
31-12-2014 8
31-01-2015 1
28-02-2015 3
31-03-2015 4
30-04-2015 5
df2
Date Y
01-09-2014 1
01-10-2014 4
01-11-2014 6
01-12-2014 7
01-01-2015 2
01-02-2015 3
01-03-2015 6
01-04-2015 4
01-05-2015 3
01-06-2015 4
01-07-2015 5
01-08-2015 2
df3
Date Z
01-07-2015 9
01-08-2015 2
01-09-2015 4
01-10-2015 1
01-11-2015 2
01-12-2015 3
01-01-2016 7
01-02-2016 4
01-03-2016 9
01-04-2016 2
01-05-2016 4
01-06-2016 1
Try:
df4 = pd.merge(df1,df2, how='left', on='Date')
Result:
Date X Y
0 2014-05-31 1 NaN
1 2014-06-30 2 NaN
2 2014-07-31 3 NaN
3 2014-08-31 4 NaN
4 2014-09-30 5 NaN
5 2014-10-31 6 NaN
6 2014-11-30 7 NaN
7 2014-12-31 8 NaN
8 2015-01-31 1 NaN
9 2015-02-28 3 NaN
10 2015-03-31 4 NaN
11 2015-04-30 5 NaN

Use Series.dt.to_period with months periods and merge by multiple DataFrames in list:
import functools
dfs = [df1, df2, df3]
dfs = [x.assign(per=x['Date'].dt.to_period('m')) for x in dfs]
df = functools.reduce(lambda left,right: pd.merge(left,right,on='per', how='left'), dfs)
print (df)
Date_x X per Date_y Y Date Z
0 2014-05-31 1 2014-05 NaT NaN NaT NaN
1 2014-06-30 2 2014-06 NaT NaN NaT NaN
2 2014-07-31 3 2014-07 NaT NaN NaT NaN
3 2014-08-31 4 2014-08 NaT NaN NaT NaN
4 2014-09-30 5 2014-09 2014-09-01 1.0 NaT NaN
5 2014-10-31 6 2014-10 2014-10-01 4.0 NaT NaN
6 2014-11-30 7 2014-11 2014-11-01 6.0 NaT NaN
7 2014-12-31 8 2014-12 2014-12-01 7.0 NaT NaN
8 2015-01-31 1 2015-01 2015-01-01 2.0 NaT NaN
9 2015-02-28 3 2015-02 2015-02-01 3.0 NaT NaN
10 2015-03-31 4 2015-03 2015-03-01 6.0 NaT NaN
11 2015-04-30 5 2015-04 2015-04-01 4.0 NaT NaN
Alternative:
df1['per'] = df1['Date'].dt.to_period('m')
df2['per'] = df2['Date'].dt.to_period('m')
df3['per'] = df3['Date'].dt.to_period('m')
df4 = pd.merge(df1,df2, how='left', on='per').merge(df3, how='left', on='per')
print (df4)
Date_x X per Date_y Y Date Z
0 2014-05-31 1 2014-05 NaT NaN NaT NaN
1 2014-06-30 2 2014-06 NaT NaN NaT NaN
2 2014-07-31 3 2014-07 NaT NaN NaT NaN
3 2014-08-31 4 2014-08 NaT NaN NaT NaN
4 2014-09-30 5 2014-09 2014-09-01 1.0 NaT NaN
5 2014-10-31 6 2014-10 2014-10-01 4.0 NaT NaN
6 2014-11-30 7 2014-11 2014-11-01 6.0 NaT NaN
7 2014-12-31 8 2014-12 2014-12-01 7.0 NaT NaN
8 2015-01-31 1 2015-01 2015-01-01 2.0 NaT NaN
9 2015-02-28 3 2015-02 2015-02-01 3.0 NaT NaN
10 2015-03-31 4 2015-03 2015-03-01 6.0 NaT NaN
11 2015-04-30 5 2015-04 2015-04-01 4.0 NaT NaN

Add missing times in dataframe column with pandas

I have a dataframe like so:
df = pd.DataFrame({'time':['23:59:45','23:49:50','23:59:55','00:00:00','00:00:05','00:00:10','00:00:15'],
'X':[-5,-4,-2,5,6,10,11],
'Y':[3,4,5,9,20,22,23]})
As you can see, the time is formed by hours (string format) and are across midnight. The time is given every 5 seconds!
My goal is however to add empty rows (filled with Nan for examples) so that the time is every second. Finally the column time should be converted as a time stamp and set as index.
Could you please suggest a smart and elegant way to achieve my goal?
Here is what the output should look like:
X Y
time
23:59:45 -5.0 3.0
23:59:46 NaN NaN
23:59:47 NaN NaN
23:59:48 NaN NaN
... ... ...
00:00:10 10.0 22.0
00:00:11 NaN NaN
00:00:12 NaN NaN
00:00:13 NaN NaN
00:00:14 NaN NaN
00:00:15 11.0 23.0
Note: I do not need the dates.

Use to_timedelta with reindex by timedelta_range:
df['time'] = pd.to_timedelta(df['time'])
idx = pd.timedelta_range('0', '23:59:59', freq='S', name='time')
df = df.set_index('time').reindex(idx).reset_index()
print (df.head(10))
time X Y
0 00:00:00 5.0 9.0
1 00:00:01 NaN NaN
2 00:00:02 NaN NaN
3 00:00:03 NaN NaN
4 00:00:04 NaN NaN
5 00:00:05 6.0 20.0
6 00:00:06 NaN NaN
7 00:00:07 NaN NaN
8 00:00:08 NaN NaN
9 00:00:09 NaN NaN
If need replace NaNs:
df = df.set_index('time').reindex(idx, fill_value=0).reset_index()
print (df.head(10))
time X Y
0 00:00:00 5 9
1 00:00:01 0 0
2 00:00:02 0 0
3 00:00:03 0 0
4 00:00:04 0 0
5 00:00:05 6 20
6 00:00:06 0 0
7 00:00:07 0 0
8 00:00:08 0 0
9 00:00:09 0 0
Another solution with resample, but is possible some rows are missing in the end:
df = df.set_index('time').resample('S').first()
print (df.tail(10))
X Y
time
23:59:46 NaN NaN
23:59:47 NaN NaN
23:59:48 NaN NaN
23:59:49 NaN NaN
23:59:50 NaN NaN
23:59:51 NaN NaN
23:59:52 NaN NaN
23:59:53 NaN NaN
23:59:54 NaN NaN
23:59:55 -2.0 5.0
EDIT1:
idx1 = pd.timedelta_range('23:59:45', '23:59:59', freq='S', name='time')
idx2 = pd.timedelta_range('0', '00:00:15', freq='S', name='time')
idx = np.concatenate([idx1, idx2])
df['time'] = pd.to_timedelta(df['time'])
df = df.set_index('time').reindex(idx).reset_index()
print (df.head(10))
time X Y
0 23:59:45 -5.0 3.0
1 23:59:46 NaN NaN
2 23:59:47 NaN NaN
3 23:59:48 NaN NaN
4 23:59:49 NaN NaN
5 23:59:50 NaN NaN
6 23:59:51 NaN NaN
7 23:59:52 NaN NaN
8 23:59:53 NaN NaN
9 23:59:54 NaN NaN
print (df.tail(10))
time X Y
21 00:00:06 NaN NaN
22 00:00:07 NaN NaN
23 00:00:08 NaN NaN
24 00:00:09 NaN NaN
25 00:00:10 10.0 22.0
26 00:00:11 NaN NaN
27 00:00:12 NaN NaN
28 00:00:13 NaN NaN
29 00:00:14 NaN NaN
30 00:00:15 11.0 23.0
EDIT:
Another solution - change next day to 1 day timedeltas:
df['time'] = pd.to_timedelta(df['time'])
a = pd.to_timedelta(df['time'].diff().dt.days.abs().cumsum().fillna(1).sub(1), unit='d')
df['time'] = df['time'] + a
print (df)
X Y time
0 -5 3 0 days 23:59:45
1 -4 4 0 days 23:49:50
2 -2 5 0 days 23:59:55
3 5 9 1 days 00:00:00
4 6 20 1 days 00:00:05
5 10 22 1 days 00:00:10
6 11 23 1 days 00:00:15
idx = pd.timedelta_range(df['time'].min(), df['time'].max(), freq='S', name='time')
df = df.set_index('time').reindex(idx).reset_index()
print (df.head(10))
time X Y
0 23:49:50 -4.0 4.0
1 23:49:51 NaN NaN
2 23:49:52 NaN NaN
3 23:49:53 NaN NaN
4 23:49:54 NaN NaN
5 23:49:55 NaN NaN
6 23:49:56 NaN NaN
7 23:49:57 NaN NaN
8 23:49:58 NaN NaN
9 23:49:59 NaN NaN
print (df.tail(10))
time X Y
616 1 days 00:00:06 NaN NaN
617 1 days 00:00:07 NaN NaN
618 1 days 00:00:08 NaN NaN
619 1 days 00:00:09 NaN NaN
620 1 days 00:00:10 10.0 22.0
621 1 days 00:00:11 NaN NaN
622 1 days 00:00:12 NaN NaN
623 1 days 00:00:13 NaN NaN
624 1 days 00:00:14 NaN NaN
625 1 days 00:00:15 11.0 23.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to collapse rows in dataframe by similar columns - python

Related

Why is pandas str.replace returning NaN?

How do I concat these two data frames together to get them to overlap and fit with the date?

How can I split this excel file into two data frames?

Merging multiple dataframe using month datetime

Add missing times in dataframe column with pandas

Categories

Resources