Matching two datasets with different daterange and different length - python

I have two csv-files with different dateformats and lenght.
First, I load these two files:
frameA = pd.read_csv("fileA.csv", dtype=str, delimiter=";", skiprows = None)
File A has 102216 rows x 3 columns, ends at 01.07.2012 00:00. Date and Time are in one column. Head looks like this:
Date Buy Sell
0 01.08.2009 00:15 0 0
1 01.08.2009 00:30 0 0
2 01.08.2009 00:45 0 0
3 01.08.2009 01:00 0 0
4 01.08.2009 01:15 0 0
.
frameB = pd.read_csv("fileB.csv", dtype=str, delimiter=";", skiprows = None)
File B has 92762 rows x 4 columns, ends at 22.07.2012 00:00. Date and Time are separate. Head looks like this:
Date Time Buy Sell
0 01.08.2009 01:00 0 0
1 01.08.2009 02:00 0 0
2 01.08.2009 03:00 0 0
3 01.08.2009 04:00 0 10
4 01.08.2009 05:00 0 32
How can I match these datas like this:
Buy A Sell A Buy B Sell B
0 01.08.2009 00:15 0 0 0 0
1 01.08.2009 00:30 0 0 0 0
Both has to start and to end with the same date and the frequency has to be 15 min.
How can I get this? What should I do first?

OK, first thing is to make sure both df's have datetimes as dtypes for the first df:
frameA = pd.read_csv("fileA.csv", dtype=str, delimiter=";", skiprows = None, parse_dates=['Date'])
and for the other df:
frameB = pd.read_csv("fileB.csv", dtype=str, delimiter=";", skiprows = None, parse_dates=[['Date','Time']])
Now I would reset the minute value of the first df like so:
In [149]:
df['Date'] = df['Date'].apply(lambda x: x.replace(minute=0))
df
Out[149]:
Date Buy Sell
index
0 2009-01-08 04:00:00 0 0
1 2009-01-08 04:00:00 0 0
2 2009-01-08 04:00:00 0 0
3 2009-01-08 05:00:00 0 0
4 2009-01-08 05:00:00 0 0
Now we can merge the dfs:
In [150]:
merged = df.merge(df1, left_on=['Date'], right_on=['Date_Time'], how='left',suffixes=[' A', ' B'])
merged
Out[150]:
Date Buy A Sell A Date_Time Buy B Sell B
0 2009-01-08 04:00:00 0 0 2009-01-08 04:00:00 0 10
1 2009-01-08 04:00:00 0 0 2009-01-08 04:00:00 0 10
2 2009-01-08 04:00:00 0 0 2009-01-08 04:00:00 0 10
3 2009-01-08 05:00:00 0 0 2009-01-08 05:00:00 0 32
4 2009-01-08 05:00:00 0 0 2009-01-08 05:00:00 0 32
Obviously replace df, df1 with frameA and frameB in your case

Another thing you could do is set the date to the index:
As the answer above correctly states, the first step is to parse them into an identical format.
frameA = pd.read_csv("fileA.csv", dtype=str, delimiter=";", skiprows = None, parse_dates=['Date'])
frameB = pd.read_csv("fileB.csv", dtype=str, delimiter=";", skiprows = None, parse_dates=[['Date','Time']])
After including the data into the arrays, as shown above, we can set the index to the data to guide the merger:
frameA.index = frameA['Date']
frameB.index = frameB['Date']
Then, they will merge on the exact same index, and since they have similar columns ('Buy', 'Sell'), we need to specify suffixes for the merger:
merge = frameA.join(frameB, lsuffix = ' A', rsuffix = ' B')
The result would look exactly like this.
Buy A Sell A Buy B Sell B
0 01.08.2009 00:15 0 0 0 0
1 01.08.2009 00:30 0 0 0 0
The advantage of this approach is that if your second data set ('Buy B', 'Sell B') is missing times present in the first slot, the merger will still work and you won't have data misassigned to the improper time. Let's say we have an arbitrary numerical index from 1-10000 for both, and the second dataframe is missing 3 values (index only goes from 1-9997). This will cause a shift, and then we improperly assign values to the wrong index is the one guiding the joining.
Here, as long as the dataframe guiding the joining is longer than the second dataframe, we won't lose any data, and we will never poorly assign it to the wrong index.
So for example:
if len(frameA.index) >= len(frameB.index):
merge = frameA.join(frameB)
else:
print 'Missing Values, define your own function here'
quit()
EDIT:
Another way to make sure all data is reported, regardless of whether it occurs in both columns would be to define a new dataframe with a unique list of dates present in both dataframes, and use that to guide the merger.
For example,
unique_index = sorted(list(set(frameA.index.tolist() + frameB.index.tolist())))
Defines a unique index by summing both index lists, turning it to a set, and back to a list. Sets remove all redundant values, so you have a unique list, and the list is sorted since sets are not ordered.
Then, you merge the dataframes:
merge = pd.DataFrame(index = unique_index)
merge = merge.join(frameA)
merge = merge.join(frameB, lsuffix = ' A', rsuffix = ' B')
Just make sure to export it with the index ON, or redefine the index as a column (exporting to a csv or an excel sheet automatically has the index on unless you turn it off, so just be sure not to set index = False).
And then any missing data from your 'Buy A', 'Sell A' columns that is present in 'Buy B', 'Sell B' will be 'nan', as will be data missing from 'Buy B', 'Sell B' that is present in 'Buy A', 'Sell A'.

Related

How can I count values for each date in a Dataframe based conditionally on the value of a column?

I have a dataframe with xenophobic and non-xenophobic tweets.
For each day, I want to count the number of tweets that have a sentiment of 1.
This is the Dataframes df_unevaluated
sentiment id date text
0 0 9.820000e+17 2018-04-05 11:43:31+00:00 but if she had stated another fact like that I may have thought...
1 0 1.170000e+18 2019-09-03 22:53:30+00:00 the worst thing that dude has done this week is ramble about the...
2 0 1.140000e+18 2019-06-28 17:43:07+00:00 i think immigrants of all walks of life should be allowed into...
3 0 2.810000e+17 2012-12-18 00:43:57+00:00 why is america not treating the immigrants like normal people...
4 1 8.310000e+17 2017-02-14 01:42:26+00:00 who the hell wants to live in canada anyhow the people there...
...
This is what I've tried:
# Put all tweets with sentiment = 1 into a Dataframes
for i in range(len(df_unevaluated)):
if df_unevaluated['sentiment'][i] == 1:
df_xenophobic = df_xenophobic.append(df_unevaluated.iloc[[i]])
# Store a copy of df_xenophobic in df_counts
df_counts = df_xenophobic
# Change df_counts to get counts for each date
df_counts = (pd.to_datetime(df_counts['date'])
.dt.floor('d')
.value_counts()
.rename_axis('date')
.reset_index(name='count'))
# Sort data and drop index column
df_counts = df_counts.sort_values('date')
df_counts = df_counts.reset_index(drop=True)
# Look at data
df_counts.head()
This was the output:
date count
0 2012-03-14 00:00:00+00:00 1
1 2012-03-19 00:00:00+00:00 1
2 2012-04-07 00:00:00+00:00 1
3 2012-04-10 00:00:00+00:00 1
4 2012-04-19 00:00:00+00:00 1
...
This is what I expected:
date count
0 2012-03-14 00:00:00+00:00 1
1 2012-03-15 00:00:00+00:00 0
2 2012-03-16 00:00:00+00:00 0
3 2012-03-17 00:00:00+00:00 0
4 2012-03-18 00:00:00+00:00 0
5 2012-03-19 00:00:00+00:00 1
6 2012-03-20 00:00:00+00:00 0
7 2012-03-21 00:00:00+00:00 0
...
These are some links I've read through:
Python & Pandas - Group by day and count for each day
Using value_counts in pandas with conditions
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.floor.html
To be more clear, the each date has the format YYYY-MM-DD HH:MM:SS+00:00
As seen in my attempt, I try to round the dates column to its day. My goal is to count the number of times sentiment = 1 for that day.
If I understood your question correctly, then it should be as simple as follows:
import pandas as pd
# Data Load
df = pd.DataFrame(data={'Date': ['2022-11-28 11:43:31+00:00', '2022-11-28 22:53:30+00:00', '2022-11-29 17:43:07+00:00', '2022-12-01 01:42:26+00:00', '2022-12-01 02:40:26+00:00'],
'Sentiment': [ 0, 1, 0, 1, 1]})
df['Date'] = pd.to_datetime(df['Date']).dt.date
df_counts = df.groupby(by=['Date']).sum().reset_index()
The df_counts data frame should give output like this:

How to convert one record per change to continuous data?

My data looks like this:
print(df)
DateTime, Status
'2021-09-01', 0
'2021-09-05', 1
'2021-09-07', 0
And I need it to look like this:
print(df_desired)
DateTime, Status
'2021-09-01', 0
'2021-09-02', 0
'2021-09-03', 0
'2021-09-04', 0
'2021-09-05', 1
'2021-09-06', 1
'2021-09-07', 0
Right now I accomplish this using pandas like this:
new_index = pd.DataFrame(index = pd.date_range(df.index[0], df.index[-1], freq='D'))
df = new_index.join(df).ffill()
Missing values before the first record in any column are imputed using the inverse of the first record in that column because it's binary and only shows change-points this is guaranteed to be correct.
To my understanding my desired dataframe contained "continuous" data, but I'm not sure what to call the data structure in my source data.
The problem:
When I do this to a dataframe that has a frequency of one record per second and I want to load a year's worth of data my memory overflows (92GB required, ~60GB available). I'm not sure if there is a standard procedure instead of my solution that I don't know the name of and cannot find using google or that I'm using the join method wrong, but this seems horribly inefficient, the resulting dataframe is only a few 100 megabytes after this operation. Any feedback on this would be great!
Use DataFrame.asfreq working with DatetimeIndex:
df = df.set_index('DateTime').asfreq('d', method='ffill').reset_index()
print (df)
DateTime Status
0 2021-09-01 0
1 2021-09-02 0
2 2021-09-03 0
3 2021-09-04 0
4 2021-09-05 1
5 2021-09-06 1
6 2021-09-07 0
You can use this pipeline:
(df.set_index('DateTime')
.reindex(pd.date_range(df['DateTime'].min(), df['DateTime'].max()))
.rename_axis('DateTime')
.ffill(downcast='infer')
.reset_index()
)
output:
DateTime Status
0 2021-09-01 0
1 2021-09-02 0
2 2021-09-03 0
3 2021-09-04 0
4 2021-09-05 1
5 2021-09-06 1
6 2021-09-07 0
input:
DateTime Status
0 2021-09-01 0
1 2021-09-05 1
2 2021-09-07 0

Elegant way to shift multiple date columns - Pandas

I have a dataframe like as shown below
df = pd.DataFrame({'person_id': [11,11,11,21,21],
'offset' :['-131 days','29 days','142 days','20 days','-200 days'],
'date_1': ['05/29/2017', '01/21/1997', '7/27/1989','01/01/2013','12/31/2016'],
'dis_date': ['05/29/2017', '01/24/1999', '7/22/1999','01/01/2015','12/31/1991'],
'vis_date':['05/29/2018', '01/27/1994', '7/29/2011','01/01/2018','12/31/2014']})
df['date_1'] = pd.to_datetime(df['date_1'])
df['dis_date'] = pd.to_datetime(df['dis_date'])
df['vis_date'] = pd.to_datetime(df['vis_date'])
I would like to shift all the dates of each subject based on his offset
Though my code works (credit - SO), I am looking for an elegant approach. You can see am kind of repeating almost the same line thrice.
df['offset_to_shift'] = pd.to_timedelta(df['offset'],unit='d')
#am trying to make the below lines elegant/efficient
df['shifted_date_1'] = df['date_1'] + df['offset_to_shift']
df['shifted_dis_date'] = df['dis_date'] + df['offset_to_shift']
df['shifted_vis_date'] = df['vis_date'] + df['offset_to_shift']
I expect my output to be like as shown below
Use, DataFrame.add along with DataFrame.add_prefix and DataFrame.join:
cols = ['date_1', 'dis_date', 'vis_date']
df = df.join(df[cols].add(df['offset_to_shift'], 0).add_prefix('shifted_'))
OR, it is also possible to use pd.concat:
df = pd.concat([df, df[cols].add(df['offset_to_shift'], 0).add_prefix('shifted_')], axis=1)
OR, we can also directly assign the new shifted columns to the dataframe:
df[['shifted_' + col for col in cols]] = df[cols].add(df['offset_to_shift'], 0)
Result:
# print(df)
person_id offset date_1 dis_date vis_date offset_to_shift shifted_date_1 shifted_dis_date shifted_vis_date
0 11 -131 days 2017-05-29 2017-05-29 2018-05-29 -131 days 2017-01-18 2017-01-18 2018-01-18
1 11 29 days 1997-01-21 1999-01-24 1994-01-27 29 days 1997-02-19 1999-02-22 1994-02-25
2 11 142 days 1989-07-27 1999-07-22 2011-07-29 142 days 1989-12-16 1999-12-11 2011-12-18
3 21 20 days 2013-01-01 2015-01-01 2018-01-01 20 days 2013-01-21 2015-01-21 2018-01-21
4 21 -200 days 2016-12-31 1991-12-31 2014-12-31 -200 days 2016-06-14 1991-06-14 2014-06-14

Python/Pandas - TypeError when concatenating MultiIndex DataFrames

I have trouble concatenating a list of MultiIndex DataFrames with 2 levels, and adding a third one to distinguish them.
As an example, I have following input data.
import pandas as pd
import numpy as np
# Input data
start = '2020-01-01 00:00+00:00'
end = '2020-01-01 02:00+00:00'
pr1h = pd.period_range(start=start, end=end, freq='1h')
midx1 = pd.MultiIndex.from_tuples([('Sup',1),('Sup',2),('Inf',1),('Inf',2)], names=['Data','Position'])
df1 = pd.DataFrame(np.random.rand(3,4), index=pr1h, columns=midx1)
df3 = pd.DataFrame(np.random.rand(3,4), index=pr1h, columns=midx1)
midx2 = pd.MultiIndex.from_tuples([('Sup',3),('Inf',3)], names=['Data','Position'])
df2 = pd.DataFrame(np.random.rand(3,2), index=pr1h, columns=midx2)
df4 = pd.DataFrame(np.random.rand(3,2), index=pr1h, columns=midx2)
So df1 & df2 have data for the same tag 1h and while they have the same column names at Data level, they don't have the same column names at Position level.
df1
Data Sup Inf
Position 1 2 1 2
2020-01-01 00:00 0.660795 0.538452 0.861801 0.502479
2020-01-01 01:00 0.205806 0.847124 0.474861 0.906546
2020-01-01 02:00 0.681480 0.479512 0.631771 0.961844
df2
Data Sup Inf
Position 3 3
2020-01-01 00:00 0.758533 0.672899
2020-01-01 01:00 0.096463 0.304843
2020-01-01 02:00 0.080504 0.990310
Now, df3 and df4 follow the same logic and same column names. To distinguish them from df1 & df2, I want to use a different tag, 2h for instance.
I want to add this third level named Period during the call to pd.concat. For this, I am trying to use keys parameter in pd.concat(). I tried following code.
df_list = [df1, df2, df3, df4]
period_list = ['1h', '1h', '2h', '2h']
concatenated = pd.concat(df_list, keys=period_list, names=('Period', 'Data', 'Position'), axis=1)
But this raises following error.
TypeError: int() argument must be a string, a bytes-like object or a number, not 'slice'
Please, any idea what is the correct call for this?
I thank you for your help. Bests,
EDIT 05/05
As requested, here is desired result (copied directly from the answer given. Result obtained from given answer is the one I am looking for).
Period 1h \
Data Sup Inf Sup Inf
Position 1 2 1 2 3 3
2020-01-01 00:00 0.309778 0.597582 0.872392 0.983021 0.659965 0.214953
2020-01-01 01:00 0.467403 0.875744 0.296069 0.131291 0.203047 0.382865
2020-01-01 02:00 0.842818 0.659036 0.595440 0.436354 0.224873 0.114649
Period 2h
Data Sup Inf Sup Inf
Position 1 2 1 2 3 3
2020-01-01 00:00 0.356250 0.587131 0.149471 0.171239 0.583017 0.232641
2020-01-01 01:00 0.397165 0.637952 0.372520 0.002407 0.556518 0.523811
2020-01-01 02:00 0.548816 0.126972 0.079793 0.235039 0.350958 0.705332
A quick fix would be to use different names in period_list and rename just after the concat. Something like:
df_list = [df1, df2, df3, df4]
period_list = ['1h_a', '1h_b', '2h_a', '2h_b']
concatenated = pd.concat(df_list,
keys=period_list,
names=('Period', 'Data', 'Position'),
axis=1)\
.rename(columns={col:col.split('_')[0] for col in period_list},
level='Period')
print (concatenated)
Period 1h \
Data Sup Inf Sup Inf
Position 1 2 1 2 3 3
2020-01-01 00:00 0.309778 0.597582 0.872392 0.983021 0.659965 0.214953
2020-01-01 01:00 0.467403 0.875744 0.296069 0.131291 0.203047 0.382865
2020-01-01 02:00 0.842818 0.659036 0.595440 0.436354 0.224873 0.114649
Period 2h
Data Sup Inf Sup Inf
Position 1 2 1 2 3 3
2020-01-01 00:00 0.356250 0.587131 0.149471 0.171239 0.583017 0.232641
2020-01-01 01:00 0.397165 0.637952 0.372520 0.002407 0.556518 0.523811
2020-01-01 02:00 0.548816 0.126972 0.079793 0.235039 0.350958 0.705332
Edit: as speed is a concern, it seems that rename is slow, so you can do:
concatenated = pd.concat(df_list,
keys=period_list,
axis=1)
concatenated.columns = pd.MultiIndex.from_tuples([(col[0].split('_')[0], col[1], col[2])
for col in concatenated.columns],
names=('Period', 'Data', 'Position'), )
Consider an inner concat on similar data frames then run a final concat to bind all together:
concatenated = pd.concat([pd.concat([df1, df2], axis=1),
pd.concat([df3, df4], axis=1)],
keys = ['1h', '2h'],
names=('Period', 'Data', 'Position'),
axis=1)
print(concatenated)
Period 1h \
Data Sup Inf Sup Inf
Position 1 2 1 2 3 3
2020-01-01 00:00 0.189802 0.675083 0.624484 0.781774 0.453101 0.224525
2020-01-01 01:00 0.249818 0.829180 0.190488 0.923107 0.495873 0.278201
2020-01-01 02:00 0.602634 0.494915 0.612672 0.903609 0.426809 0.248981
Period 2h
Data Sup Inf Sup Inf
Position 1 2 1 2 3 3
2020-01-01 00:00 0.746499 0.385714 0.008561 0.961152 0.988231 0.897454
2020-01-01 01:00 0.643730 0.365023 0.812249 0.291733 0.045417 0.414968
2020-01-01 02:00 0.887567 0.680102 0.978388 0.018501 0.695866 0.679730

How can I add parts of a column to a new pandas data frame?

So I have a pandas data frame of lenght 90 which isn't important
Lets say I have :
df
A date
1 2012-01-01
4 2012-02-01
5 2012-03-01
7 2012-04-01
8 2012-05-01
9 2012-06-01
2 2012-07-01
1 2012-08-01
3 2012-09-01
2 2012-10-01
5 2012-11-01
9 2012-12-01
0 2013-01-01
6 2013-02-01
and I have created a new data frame
df_copy=df.copy()
index = range(0,3)
df1 = pd.DataFrame(index=index, columns=range((len(df_copy.columns))))
df1.columns = df_copy.columns
df1['date'] = pd.date_range('2019-11-01','2020-01-01' , freq='MS')-pd.offsets.MonthBegin(1)
which should create a data frame like this
A date
na 2019-10-01
na 2019-11-01
na 2019-12-01
So I use the following code to get the values of A in my new data frame
df1['A'] = df1['A'].iloc[9:12]
And I want the outcome to be this
A date
2 2019-10-01
5 2019-11-01
9 2019-12-01
so I want that the last 3 values are assigned the value that has iloc position 9-12 in the new data frame, the indexes are different and so are the dates in both data frames. Is there a way to do this because
df1['A'] = df1['A'].iloc[9:12]
doesn't seem to work
According to my knowledge you can solve this by genearting several new data frames
df_copy=df.copy()
index = range(0,1)
df1 = pd.DataFrame(index=index, columns=range((len(df_copy.columns))))
df1.columns = df_copy.columns
df1['date'] = pd.date_range('2019-11-01','2019-11-01' , freq='MS')-pd.offsets.MonthBegin(1)
df1['A'] = df1['A'].iloc[9]
Then appending to your original data frame and repeating it is a bit overwhemling but it seems like the only solution i could came up with

Categories

Resources