How do I merge two tables with dates within the key (Python) - python

I've been wandering a lot before I could find a solution to my issue and I wanted to ask the community if you had a better idea than the one I came up with.
My problem is the following:
I have two tables (one table is my source data and the other is the mapping) that i want to merge through a certain key.
In my source data, I have two dates: Date_1 and Date_2
In my mapping, I have four dates: Date_1_begin, Date_1_end, Date_2_begin, Date_2_end
The problem is: those dates are part of my key.
For example:
df
A B date
0 1 A 20210310
1 1 A 20190101
2 3 C 19981231
mapping
A B date_begin date_end code
0 1 A 19600101 20201231 1
1 1 A 20210101 20991231 2
2 3 C 19600101 20991231 3
The idea is that: doing something like this:
pd.merge(df, mapping, on = ['A','B'])
would give me two codes for key 1_A : 1 and 2. But I want a 1-1 relation.
In order to assign the right code considering the dates, I did something like this using piecewise
from numpy library:
df_date= df['date'].values
conds = [(df_date >= start_date)&(df_date<= end_date)] for start_date, end_date in zip(mapping.date_begin.values, mapping.date_end.values)]
result = np.piecewise(np.zeros(len(df)), conds, mapping['code'].values)
df['code'] = result
And it works fine... But I figured it must exist somewhere something easier and classier maybe...
Many thanks in advance!
Clem

You need to add enumeration to the duplicate rows:
(df1.assign(enum=df1.groupby(['A','B'].cumcount())
.merge(df2.assign(enum=df2.groupby(['A','B']).cumcount()),
on=['A','B','enum'])
)

Related

filtering pandas dataframe when data contains two parts

I have a pandas dataframe and want to filter down to all the rows that contain a certain criteria in the “Title” column.
The rows I want to filter down to are all rows that contain the format “(Axx)” (Where xx are 2 numbers).
The data in the “Title” column doesn’t just consist of “(Axx)” data.
The data in the “Title” column looks like so:
“some_string (Axx)”
What Ive been playing around a bit with different methods but cant seem to get it.
I think the closest ive gotten is:
df.filter(regex=r'(D\d{2})', axis=0))
but its not correct as the entries aren’t being filtered.
Use Series.str.contains with escape () and $ for end of string and filter in boolean indexing:
df = pd.DataFrame({'Title':['(D89)','aaa (D71)','(D5)','(D78) aa','D72']})
print (df)
Title
0 (D89)
1 aaa (D71)
2 (D5)
3 (D78) aa
df1 = df[df['Title'].str.contains(r'\(D\d{2}\)$')]
print (df1)
4 D72
Title
0 (D89)
1 aaa (D71)
If ned match only (Dxx) use Series.str.match:
df2 = df[df['Title'].str.match(r'\(D\d{2}\)$')]
print (df2)
Title
0 (D89)

How can I compare each row from a dataframe against every row from another dataframe and see the difference between values?

I have two dataframes:
df1
Code Number
0 ABC123 1
1 DEF456 2
2 GHI789 3
3 DEA456 4
df2
Code
0 ABD123
1 DEA458
2 GHI789
df1 acts like a dictionary, from which I can get the respective number for each item by checking their code. There are, however, unregistered codes, and in case I find an unregistered code, I'm supposed to look for the codes that look the most like them. So, the outcome should to be:
ABD123 = 1 (because it has 1 different character from ABC123)
DEA456 = 4 (because it has 1 different character from DEA456, and 2 from DEF456, so it chooses the closest one)
GHI789 = 3 (because it has an equivalent at df1)
I know how to check for the differences of each code individually and save the "length" of characters that differ, but I don't know how to apply this code as I don't know how to compare each row from df2 against all rows from df1. Is there a way?
don't know how to compare each row from df2 against all rows from df1.
Nested loops will work. If you had a function named compare it would look like this...
for index2, row2 in df2.iterrows():
for index1, row1 in df1.iterrows():
difference = compare(row2,row1)
#do something with the difference.
Nested loops are usually not ideal when working with Pandas or Numpy but they do work. There may be better solutions.
DataFrame.iterrows()
This should work too:
df['Code_e'] = df['Code'].str.extract(r'(\d+)').astype(int)
df2['Code_e'] = df2['Code'].str.extract(r'(\d+)').astype(int)
final_df = pd.merge_asof(df2,df.sort_values(by='Code_e'),on='Code_e',suffixes=('','_right')).drop(['Code_e','Code_right'],axis=1)

Pandas: Nesting Dataframes

Hello I want to store a dataframe in another dataframe cell.
I have a data that looks like this
I have daily data which consists of date, steps, and calories. In addition, I have minute by minute HR data of a specific date. Obviously it would be easy to put the minute by minute data in 2 dimensional list but I'm fearing that would be harder to analyze later.
What would be the best practice when I want to have both data in one dataframe? Is it even possible to even nest dataframes?
Any better ideas ? Thanks!
Yes, it seems possible to nest dataframes but I would recommend instead rethinking how you want to structure your data, which depends on your application or the analyses you want to run on it after.
How to "nest" dataframes into another dataframe
Your dataframe containing your nested "sub-dataframes" won't be displayed very nicely. However, just to show that it is possible to nest your dataframes, take a look at this mini-example:
Here we have 3 random dataframes:
>>> df1
0 1 2
0 0.614679 0.401098 0.379667
1 0.459064 0.328259 0.592180
2 0.916509 0.717322 0.319057
>>> df2
0 1 2
0 0.090917 0.457668 0.598548
1 0.748639 0.729935 0.680409
2 0.301244 0.024004 0.361283
>>> df3
0 1 2
0 0.200375 0.059798 0.665323
1 0.086708 0.320635 0.594862
2 0.299289 0.014134 0.085295
We can make a main dataframe that includes these dataframes as values in individual "cells":
df = pd.DataFrame({'idx':[1,2,3], 'dfs':[df1, df2, df3]})
We can then access these nested datframes as we would access any value in any other dataframe:
>>> df['dfs'].iloc[0]
0 1 2
0 0.614679 0.401098 0.379667
1 0.459064 0.328259 0.592180
2 0.916509 0.717322 0.319057

Merges and joins in pandas

I am joining two DataFrame tables that show sums of elements from two different months.
Here is df1:
Query ValueA0
0 IO1_DerivativeReceivables_ChathamLocal 673437.850000
1 IO2_CollateralCalledforReceipt 60000.000000
2 OO1_DerivativePayables_ChathamLocal 73537.550000
Here is df2:
Query ValueB0
0 IO1_DerivativeReceivables_ChathamLocal 336705.200000
1 IO2_CollateralCalledforReceipt 20920.000000
2 OO1_DerivativePayables_ChathamLocal 11299.130000
Note that the queries are the same, but the values are different.
I tried to join them with the following code:
import pandas as pd
pd.merge(df1, df2, on='Query')
This was my result:
Query ValueA0 \
0 IO1_DerivativeReceivables_ChathamLocal 673437.850000
1 IO2_CollateralCalledforReceipt 60000.000000
2 OO1_DerivativePayables_ChathamLocal 73537.550000
ValueB0
0 336705.200000
1 20920.000000
2 11299.130000
This is what I was expecting:
Query ValueA0 ValueB0
0 IO1_DerivativeReceivables_ChathamLocal 673437.850000 336705.200000
1 IO2_CollateralCalledforReceipt 60000.000000 20920.000000
2 OO1_DerivativePayables_ChathamLocal 73537.550000 11299.130000
How do I do this? The join seems fairly simple. I have tried several variations of joins and always end up with the tables appearing as though they are separated. Is this correct?
See this. I had a datframe and it showed me like this, but the datframe was one intact data frame...
The join is correct - no further information is needed.

How to apply a function to each column of a pivot table in pandas?

Code:
df = pd.read_csv("example.csv", parse_dates=['ds'])
df2 = df.set_index(['ds', 'city']).unstack('city')
rm = pd.rolling_mean(df2, 3)
sd = pd.rolling_std(df2,3)
df2 output:
What I want: I want to be able to see whether for each city, for each date, if the number is greater than 1 std dev away from the mean of bookings for that city. For ex pseudocode:
for each (city column)
for each (date)
see whether the (number of bookings) - (same date and city rolling mean) > (same date and city std dev)
print that date and city and number of bookings
What the problem is: I'm having trouble trying to figure out how to access the data I need from each of the data frames to do so. The parts of the pseudocode in parenthesis is what I need help figuring out.
What I tried:
df2['city']
list(df2)
Both give me errors.
df2[1:2]
Splicing works, but I feel like thats not the best way to access it.
You should use apply function of DataFrame API. Demo is below:
import pandas as pd
df = pd.DataFrame({'A': [1,2,3,4,5]; 'B': [1,2,3,4,5]})
df['C'] = df.apply(lambda row: row['A']*row['B'], axis=1)
Output:
>>> df
A B C
0 1 1 1
1 2 2 4
2 3 3 9
3 4 4 16
4 5 5 25
More concretely for your case:
You have to precompute: "same date and city rolling mean", "same date and city std dev". You can use groupby function for it, it allows to aggregate data by city and date, after that you can calculate std dev and mean.
Put std dev and mean in your table, use dictionary for it: some_dict = {('city', 'date'):[std_dev, mean], ..}. For putting data in dataframe use apply function.
You have all necessary data for running your check by apply function.

Categories

Resources