Convert an array-like columns to fill each row (reshape) - python

This is a sample data:
import pandas as pd
import numpy as np
data = pd.DataFrame([(date(2022,1,1), np.random.randint(10, size=30)),
(date(2022,2,1),np.random.randint(10, size=30)),
(date(2022,3,1),np.random.randint(10, size=30))],
columns=('month_begin','daily_sales'))
I want to (1) create a column to be filled with each day (so the column would be 2022-01-01, 2022-01-02, ... 2022-03-31); (2) break the array-like string column into each row. Something like this:
I was thinking about creating a list of days between 2022-01-01 to 2022-03-01, but was stuck on how to fill each row with the daily data. Any suggestion is appreciated!

import pandas as pd
import numpy as np
from datetime import date
data = pd.DataFrame([(date(2022,1,1), np.random.randint(10, size=30)),
(date(2022,2,1),np.random.randint(10, size=30)),
(date(2022,3,1),np.random.randint(10, size=30))],
columns=('month_begin','daily_sales'))
result = pd.DataFrame()
for index, row in data.iterrows():
df = pd.DataFrame({'date':pd.date_range(row['month_begin'],
periods=len(row['daily_sales'])),
'daily_sales':row['daily_sales'],
})
result = pd.concat([result, df], ignore_index=True)
print(result)
Output:
date daily_sales
0 2022-01-01 9
1 2022-01-02 7
2 2022-01-03 7
3 2022-01-04 8
4 2022-01-05 4
.. ... ...
85 2022-03-26 3
86 2022-03-27 1
87 2022-03-28 9
88 2022-03-29 7
89 2022-03-30 0
[90 rows x 2 columns]

Related

Adding repeating date column to pandas DataFrame

I am new to pandas and I am struggling adding dates to my pandas dataFrame df that comes from .csv file. I have a dataFrame with several unique ids, and each id has 120 months, I need to add a column date. Each id should have exactly the same dates for 120 periods. I am struggling to add them as after first id there is another id and the dates should start over again. my data in csv file looks like this:
month id
1 1593
2 1593
...
120 1593
1 8964
2 8964
...
120 8964
1 58944
...
Here is my code and I am not really sure how should I use groupby method to add dates for my dataframe based on id:
group=df.groupby('id')
group['date']=pd.date_range(start='2020/6/1', periods=120, freq='MS').shift(14,freq='D')
Please help me!!!
If you know how many sets of 120 you have, you can use this. Just change the 2 at the end. This example creates a repeating 120 dates twice. You may have to adapt for your specific use.
new_dates = list(pd.date_range(start='2020/6/1', periods=120, freq='MS').shift(14,freq='D'))*2
df = pd.DataFrame({'date': new_dates})
These are the same except ones using lambda
def repeatingDates(numIds): return [d.strftime(
'%Y/%m/%d') for d in pandas.date_range(start='2020/6/1', periods=120, freq='MS')] * numIds
repeatingDates = lambda numIds: [d.strftime(
'%Y/%m/%d') for d in pandas.date_range(start='2020/6/1', periods=120, freq='MS')] * numIds
You can use Pandas transform. This is how I solved it:
dataf['dates'] = \
(dataf
.groupby("id")
.transform(lambda d: pd.date_range(start='2020/6/1', periods=d.max(), freq='MS').shift(14,freq='D')
)
Results:
month id dates
0 1 1593 2020-06-15
1 2 1593 2020-07-15
2 3 1593 2020-08-15
3 1 8964 2020-06-15
4 2 8964 2020-07-15
5 1 58944 2020-06-15
6 2 58944 2020-07-15
7 3 58944 2020-08-15
8 4 58944 2020-09-15
Test data:
import io
import pandas as pd
dataf = pd.read_csv(io.StringIO("""
month,id
1,1593
2,1593
3,1593
1,8964
2,8964
1,58944
2,58944
3,58944
4,58944""")).astype(int)

Problem with group by max period in dataframe pandas

I'm still a novice with python and I'm having problems trying to group some data to show that record that has the highest (maximum) date, the dataframe is as follows:
...
I am trying the following:
df_2 = df.max(axis = 0)
df_2 = df.periodo.max()
df_2 = df.loc[df.groupby('periodo').periodo.idxmax()]
And it gives me back:
Timestamp('2020-06-01 00:00:00')
periodo 2020-06-01 00:00:00
valor 3.49136
Although the value for 'periodo' is correct, for 'valor' it is not, since I need to obtain the corresponding complete record ('period' and 'value'), and not the maximum of each one. I have tried other ways but I can't get to what I want ...
I need to do?
Thank you in advance, I will be attentive to your answers!
Regards!
# import packages we need, seed random number generator
import pandas as pd
import datetime
import random
random.seed(1)
Create example dataframe
dates = [single_date for single_date in (start_date + datetime.timedelta(n) for n in range(day_count))]
values = [random.randint(1,1000) for _ in dates]
df = pd.DataFrame(zip(dates,values),columns=['dates','values'])
ie df will be:
dates values
0 2020-01-01 389
1 2020-01-02 808
2 2020-01-03 215
3 2020-01-04 97
4 2020-01-05 500
5 2020-01-06 30
6 2020-01-07 915
7 2020-01-08 856
8 2020-01-09 400
9 2020-01-10 444
Select rows with highest entry in each column
You can do:
df[df['dates'] == df['dates'].max()]
(Or, if wanna use idxmax, can do: df.loc[[df['dates'].idxmax()]])
Returning:
dates values
9 2020-01-10 444
ie this is the row with the latest date
&
df[df['values'] == df['values'].max()]
(Or, if wanna use idxmax again, can do: df.loc[[df['values'].idxmax()]] - as in Scott Boston's answer.)
and
dates values
6 2020-01-07 915
ie this is the row with the highest value in the values column.
Reference.
I think you need something like:
df.loc[[df['valor'].idxmax()]]
Where you use idxmax on the 'valor' column. Then use that index to select that row.
MVCE:
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({'periodo':pd.date_range('2018-07-01', periods = 600, freq='d'),
'valor':np.random.random(600)+3})
df.loc[[df['valor'].idxmax()]]
Output:
periodo valor
474 2019-10-18 3.998918

Grouping by column groups on a data frama in python pandas

I have a data frame with columns for every month of every year from 2000 to 2016
df.columns
output
Index(['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06',
'2000-07', '2000-08', '2000-09', '2000-10',
...
'2015-11', '2015-12', '2016-01', '2016-02', '2016-03', '2016-04',
'2016-05', '2016-06', '2016-07', '2016-08'],
dtype='object', length=200)
and I would like to group over these column by quarters.
I have made a dictionary believing it would be the best method to use groupby then use aggregate and mean:
m2q = {'2000q1': ['2000-01', '2000-02', '2000-03'],
'2000q2': ['2000-04', '2000-05', '2000-06'],
'2000q3': ['2000-07', '2000-08', '2000-09'],
...
'2016q2': ['2016-04', '2016-05', '2016-06'],
'2016q3': ['2016-07', '2016-08']}
but
df.groupby(m2q)
is not giving me the desired output.
In fact its giving me an empty grouping.
Any suggestions to make this grouping work?
Or perhaps a more pythonian solution to categorize by quarters taking the mean of the specified columns?
You can convert your index to DatetimeIndex(example 1) or PeriodIndex(example 2).
And also please check Time Series / Date functionality subject for more detail.
import numpy as np
import pandas as pd
idx = ['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06',
'2000-07', '2000-08', '2000-09', '2000-10', '2000-11', '2000-12']
df = pd.DataFrame(np.arange(12), index=idx, columns=['SAMPLE_DATA'])
print(df)
SAMPLE_DATA
2000-01 0
2000-02 1
2000-03 2
2000-04 3
2000-05 4
2000-06 5
2000-07 6
2000-08 7
2000-09 8
2000-10 9
2000-11 10
2000-12 11
# Handle your timeseries data with pandas timeseries / date functionality
df.index=pd.to_datetime(df.index)
example 1
print(df.resample('Q').sum())
SAMPLE_DATA
2000-03-31 3
2000-06-30 12
2000-09-30 21
2000-12-31 30
example 2
print(df.to_period('Q').groupby(level=0).sum())
SAMPLE_DATA
2000Q1 3
2000Q2 12
2000Q3 21
2000Q4 30

column passed error while putting a list in a dataframe

I have this list :
20161216014500
20161216020000
20161216021500
20161216023000
20161216024500
20161216030000
20161216031500
20161216033000
20161216034500
20161216040000
20161216041500
20161216043000
20161216044500
20161216050000
20161216051500
20161216053000
20161216054500
And I want after parsing it and putting it in the correct format by this code:
for row in rows:
if "".join(row).strip() != "":
chaine = str(row[0]+row[1])
date = chaine[:10] + " " + chaine[11:]
header = parseDate(date)
header = str(header).replace('-','')
header = str(header).replace(':','')
header = str(header).replace(' ','')
print header
I want to insert the header(the list above) in a dataframe using pandas:
newDataframe = pd.DataFrame(data, index=index, columns=header)
This is the error I get:
14 columns passed, passed data had 1 columns
What is the reason of this error and how to correct it ?
You can do the same thing this way:
import pandas as pd
rows = ['20161216014500',
'20161216020000',
'20161216021500',
'20161216023000',
'20161216024500',
'20161216030000',
'20161216031500',
'20161216033000',
'20161216034500',
'20161216040000',
'20161216041500',
'20161216043000',
'20161216044500',
'20161216050000',
'20161216051500',
'20161216053000',
'20161216054500']
df = pd.DataFrame(rows, columns=['date'])
pd.to_datetime(df['date'], format='%Y%m%d%H%M%S')
df
output:
date
0 20161216014500
1 20161216020000
2 20161216021500
3 20161216023000
4 20161216024500
5 20161216030000
6 20161216031500
7 20161216033000
8 20161216034500
9 20161216040000
10 20161216041500
11 20161216043000
12 20161216044500
13 20161216050000
14 20161216051500
15 20161216053000
16 20161216054500
import io
import pandas as pd
a = io.StringIO(u"""20161216014500
20161216020000
20161216021500
20161216023000
20161216024500
20161216030000
20161216031500
20161216033000
20161216034500
20161216040000
20161216041500
20161216043000
20161216044500
20161216050000
20161216051500
20161216053000
20161216054500""")
df = pd.read_csv(a, header=None, parse_dates=[0],
date_parser=pd.tseries.tools.parse_time_string)
df.head()
Output:
0
0 2016-12-16 01:45:00
1 2016-12-16 02:00:00
2 2016-12-16 02:15:00
3 2016-12-16 02:30:00
4 2016-12-16 02:45:00

pandas datetimeindex between_time function(how to get a not_between_time)

I have a pandas df, and I use between_time a and b to clean the data. How do I
get a non_between_time behavior?
I know i can try something like.
df.between_time['00:00:00', a]
df.between_time[b,23:59:59']
then combine it and sort the new df. It's very inefficient and it doesn't work for me as I have data betweeen 23:59:59 and 00:00:00
Thanks
You could find the index locations for rows with time between a and b, and then use df.index.diff to remove those from the index:
import pandas as pd
import io
text = '''\
date,time, val
20120105, 080000, 1
20120105, 080030, 2
20120105, 080100, 3
20120105, 080130, 4
20120105, 080200, 5
20120105, 235959.01, 6
'''
df = pd.read_csv(io.BytesIO(text), parse_dates=[[0, 1]], index_col=0)
index = df.index
ivals = index.indexer_between_time('8:01:30','8:02')
print(df.reindex(index.diff(index[ivals])))
yields
val
date_time
2012-01-05 08:00:00 1
2012-01-05 08:00:30 2
2012-01-05 08:01:00 3
2012-01-05 23:59:59.010000 6

Categories

Resources