Merges and joins in pandas - python

I am joining two DataFrame tables that show sums of elements from two different months.
Here is df1:
Query ValueA0
0 IO1_DerivativeReceivables_ChathamLocal 673437.850000
1 IO2_CollateralCalledforReceipt 60000.000000
2 OO1_DerivativePayables_ChathamLocal 73537.550000
Here is df2:
Query ValueB0
0 IO1_DerivativeReceivables_ChathamLocal 336705.200000
1 IO2_CollateralCalledforReceipt 20920.000000
2 OO1_DerivativePayables_ChathamLocal 11299.130000
Note that the queries are the same, but the values are different.
I tried to join them with the following code:
import pandas as pd
pd.merge(df1, df2, on='Query')
This was my result:
Query ValueA0 \
0 IO1_DerivativeReceivables_ChathamLocal 673437.850000
1 IO2_CollateralCalledforReceipt 60000.000000
2 OO1_DerivativePayables_ChathamLocal 73537.550000
ValueB0
0 336705.200000
1 20920.000000
2 11299.130000
This is what I was expecting:
Query ValueA0 ValueB0
0 IO1_DerivativeReceivables_ChathamLocal 673437.850000 336705.200000
1 IO2_CollateralCalledforReceipt 60000.000000 20920.000000
2 OO1_DerivativePayables_ChathamLocal 73537.550000 11299.130000
How do I do this? The join seems fairly simple. I have tried several variations of joins and always end up with the tables appearing as though they are separated. Is this correct?

See this. I had a datframe and it showed me like this, but the datframe was one intact data frame...

The join is correct - no further information is needed.

Related

filtering pandas dataframe when data contains two parts

I have a pandas dataframe and want to filter down to all the rows that contain a certain criteria in the “Title” column.
The rows I want to filter down to are all rows that contain the format “(Axx)” (Where xx are 2 numbers).
The data in the “Title” column doesn’t just consist of “(Axx)” data.
The data in the “Title” column looks like so:
“some_string (Axx)”
What Ive been playing around a bit with different methods but cant seem to get it.
I think the closest ive gotten is:
df.filter(regex=r'(D\d{2})', axis=0))
but its not correct as the entries aren’t being filtered.
Use Series.str.contains with escape () and $ for end of string and filter in boolean indexing:
df = pd.DataFrame({'Title':['(D89)','aaa (D71)','(D5)','(D78) aa','D72']})
print (df)
Title
0 (D89)
1 aaa (D71)
2 (D5)
3 (D78) aa
df1 = df[df['Title'].str.contains(r'\(D\d{2}\)$')]
print (df1)
4 D72
Title
0 (D89)
1 aaa (D71)
If ned match only (Dxx) use Series.str.match:
df2 = df[df['Title'].str.match(r'\(D\d{2}\)$')]
print (df2)
Title
0 (D89)

How can I compare each row from a dataframe against every row from another dataframe and see the difference between values?

I have two dataframes:
df1
Code Number
0 ABC123 1
1 DEF456 2
2 GHI789 3
3 DEA456 4
df2
Code
0 ABD123
1 DEA458
2 GHI789
df1 acts like a dictionary, from which I can get the respective number for each item by checking their code. There are, however, unregistered codes, and in case I find an unregistered code, I'm supposed to look for the codes that look the most like them. So, the outcome should to be:
ABD123 = 1 (because it has 1 different character from ABC123)
DEA456 = 4 (because it has 1 different character from DEA456, and 2 from DEF456, so it chooses the closest one)
GHI789 = 3 (because it has an equivalent at df1)
I know how to check for the differences of each code individually and save the "length" of characters that differ, but I don't know how to apply this code as I don't know how to compare each row from df2 against all rows from df1. Is there a way?
don't know how to compare each row from df2 against all rows from df1.
Nested loops will work. If you had a function named compare it would look like this...
for index2, row2 in df2.iterrows():
for index1, row1 in df1.iterrows():
difference = compare(row2,row1)
#do something with the difference.
Nested loops are usually not ideal when working with Pandas or Numpy but they do work. There may be better solutions.
DataFrame.iterrows()
This should work too:
df['Code_e'] = df['Code'].str.extract(r'(\d+)').astype(int)
df2['Code_e'] = df2['Code'].str.extract(r'(\d+)').astype(int)
final_df = pd.merge_asof(df2,df.sort_values(by='Code_e'),on='Code_e',suffixes=('','_right')).drop(['Code_e','Code_right'],axis=1)

How do I merge two tables with dates within the key (Python)

I've been wandering a lot before I could find a solution to my issue and I wanted to ask the community if you had a better idea than the one I came up with.
My problem is the following:
I have two tables (one table is my source data and the other is the mapping) that i want to merge through a certain key.
In my source data, I have two dates: Date_1 and Date_2
In my mapping, I have four dates: Date_1_begin, Date_1_end, Date_2_begin, Date_2_end
The problem is: those dates are part of my key.
For example:
df
A B date
0 1 A 20210310
1 1 A 20190101
2 3 C 19981231
mapping
A B date_begin date_end code
0 1 A 19600101 20201231 1
1 1 A 20210101 20991231 2
2 3 C 19600101 20991231 3
The idea is that: doing something like this:
pd.merge(df, mapping, on = ['A','B'])
would give me two codes for key 1_A : 1 and 2. But I want a 1-1 relation.
In order to assign the right code considering the dates, I did something like this using piecewise
from numpy library:
df_date= df['date'].values
conds = [(df_date >= start_date)&(df_date<= end_date)] for start_date, end_date in zip(mapping.date_begin.values, mapping.date_end.values)]
result = np.piecewise(np.zeros(len(df)), conds, mapping['code'].values)
df['code'] = result
And it works fine... But I figured it must exist somewhere something easier and classier maybe...
Many thanks in advance!
Clem
You need to add enumeration to the duplicate rows:
(df1.assign(enum=df1.groupby(['A','B'].cumcount())
.merge(df2.assign(enum=df2.groupby(['A','B']).cumcount()),
on=['A','B','enum'])
)

Splitting column of a really big dataframe in two (or more) new cols

Problem
Hey there! I'm having some trouble trying to split one column of my dataframe in two (or even more) new columns. I think this depends on the fact that the dataframe I'm working with comes from a really big csv file, almost 10gb worth of space. Once it is loaded into a Pandas dataframe, this is represented by ~60mil of rows and 5 cols.
Example
Initially, the dataframes looks something like this:
In [1]: df
Out[1]:
category other_col
0 animal.cat 5
1 animal.dog 3
2 clothes.shirt.sports 6
3 shoes.laces 1
4 None 0
I want to first remove the rows of the df for which the category is not defined (i.e., the last one), and then split the category column in three new columns based on where the dot appears: one for the main category, one for the first subcategory and another one for the last subcategory (if that actually exists). Finally, I want to merge the whole dataframe back together.
In other words, this is what I want to obtain:
In [2]: df_after
Out[2]:
other_col main_cat sub_category_1 sub_category_2
0 5 animal cat None
1 3 animal dog None
2 6 clothes shirt sports
3 1 shoes laces None
My approach
My approach for this was the following:
df = df[df['category'].notnull()]
df_wt_cat = df.drop(columns=['category'])
df_cat_subcat = df['category'].str.split('.', expand=True).rename(columns={0: 'main_cat', 1: 'sub_category_1', 2: 'sub_category_2', 3: 'sub_category_3'})
df_after = pd.concat([df_wt_cat, df_cat_subcat], axis=1)
which seems to work just fine with small datasets, but it sucks up too much memory when this is applied on a dataframe that big and the Jupyter kernel just dies.
I've tried to read the dataframe in chunks, but I'm not really sure how should I proceed after that; I've obviously tried searching this kind of problem here on stack overflow, but I didn't manage to find anything useful.
Any help is appreciated!
split and join methods do the job:
results = df['category'].str.split(".", expand = True))
df_after = df.join(results)
after doing that you can freely filter resulting dataframe

Pandas: Nesting Dataframes

Hello I want to store a dataframe in another dataframe cell.
I have a data that looks like this
I have daily data which consists of date, steps, and calories. In addition, I have minute by minute HR data of a specific date. Obviously it would be easy to put the minute by minute data in 2 dimensional list but I'm fearing that would be harder to analyze later.
What would be the best practice when I want to have both data in one dataframe? Is it even possible to even nest dataframes?
Any better ideas ? Thanks!
Yes, it seems possible to nest dataframes but I would recommend instead rethinking how you want to structure your data, which depends on your application or the analyses you want to run on it after.
How to "nest" dataframes into another dataframe
Your dataframe containing your nested "sub-dataframes" won't be displayed very nicely. However, just to show that it is possible to nest your dataframes, take a look at this mini-example:
Here we have 3 random dataframes:
>>> df1
0 1 2
0 0.614679 0.401098 0.379667
1 0.459064 0.328259 0.592180
2 0.916509 0.717322 0.319057
>>> df2
0 1 2
0 0.090917 0.457668 0.598548
1 0.748639 0.729935 0.680409
2 0.301244 0.024004 0.361283
>>> df3
0 1 2
0 0.200375 0.059798 0.665323
1 0.086708 0.320635 0.594862
2 0.299289 0.014134 0.085295
We can make a main dataframe that includes these dataframes as values in individual "cells":
df = pd.DataFrame({'idx':[1,2,3], 'dfs':[df1, df2, df3]})
We can then access these nested datframes as we would access any value in any other dataframe:
>>> df['dfs'].iloc[0]
0 1 2
0 0.614679 0.401098 0.379667
1 0.459064 0.328259 0.592180
2 0.916509 0.717322 0.319057

Categories

Resources