Missing rows when merging pandas dataframes? - python

I am trying to merge the following two dataframes on the date price and code columns, but there is unexpected loss of rows:
df1.head(5)
date price code
0 2015-01-03 27.65 534
1 2015-03-30 43.65 231
2 2015-01-09 24.65 132
3 2015-11-13 87.13 211
4 2015-05-12 63.25 231
df2.head(5)
date price code volume
0 2015-01-03 27.65 534 43
1 2015-03-30 43.65 231 21
2 2015-01-09 24.65 132 88
3 2015-12-25 0.00 211 130
4 2015-03-12 11.15 211 99
df1 is a subset of df2, but without the volume column.
I tried the following code, but half of the rows disappear:
master = pd.merge(df1, df2, on=['date', 'price', 'code'])
I know I can do a left join and keep all the rows, but they will be NaN for volume, which is the entire purpose of doing the join.
I'm not sure why there are missing rows after the join, as df2 contains every single date, with consistent prices and codes as with df1. They also have the same data types, where dates are strings.
df1.dtypes
date object
price float64
code int
dtype: object
df2.dtypes
date object
price float64
code int
volume int
dtype: object
When I check if a certain value exists in a dataframe, it returns False, which could be a clue as to why they don't join as expected (unless I'm checking wrong):
df2['price'][0]
> 27.65
27.65 in df2['price']
> False
I'm at a loss about what to check next - any ideas? I think there could be some sort of encoding issues or something along those lines.
I also manually checked the dataframes for the rows which weren't able to be joined and the values definitely exist.
Edit: I know it's expected for the rows to be dropped when they do not match on the columns specified in the join. The problem is that rows are dropped even when they do match (or they appear to be).

You say "df1 is a subset of df2, but without the volume column". I suppose you don't mean that their only difference is the "volume" column, which is obviously not the only difference, otherwise the question would be trivial (of just adding the "volume" column).
You just need to add the how='outer' keyword argument.
df = pd.merge(df1, df2, on=['date', 'code', 'price'], how='outer')

With your dataframe, since your column naming is consistent, you can do the following:
master = pd.merge(df1,df2, how='left')
By default, pd.merge() joins on all common columns and performs an inner join. Use how to specify the left join.

The value disappears because row 3 and 4 for column price has different value (Also, row 4 for the code column). merge default behavior is inner therefore those rows will disappear.

Related

Pandas - pivoting multiple columns into fewer columns with some level of detail kept

Say I have the following code that generates a dataframe:
df = pd.DataFrame({"customer_code": ['1234','3411','9303'],
"main_purchases": [3,10,5],
"main_revenue": [103.5,401.5,99.0],
"secondary_purchases": [1,2,4],
"secondary_revenue": [43.1,77.5,104.6]
})
df.head()
There's the customer_code column that's the unique ID for each client.
And then there are 2 columns to indicate the purchases that took place and revenue generated from main branches by those clients.
And another 2 columns to indicate the purchases/revenue from secondary branches by those clients.
I want to get the data into a format like this, where a pivot is done where there's a new column to differentiate between main vs secondary, but the revenue numbers and purchase columns are not mixed up:
The obvious solution is just to split this into 2 dataframes, and then simply do a concatenate, but I'm wondering whether there's a built-in way to do this in a line or two - this strikes me as the kind of thing someone might have thought to bake in a solution for.
With a little column renaming to get the "revenue" and "purchases" in the column names first using a regular expression and str.replace we can use pd.wide_to_long to convert these now stubnames from columns to rows:
# Reorder column names so stubnames are first
df.columns = [df.columns[0],
*df.columns[1:].str.replace(r'(.*)_(.*)', r'\2_\1', regex=True)]
# Convert wide_to_long
df = (
pd.wide_to_long(
df,
i='customer_code',
stubnames=['purchases', 'revenue'],
j='type',
sep='_',
suffix='.*'
)
.sort_index() # Optional sort to match expected output
.reset_index() # retrieve customer_code from the index
)
df:
customer_code
type
purchases
revenue
0
1234
main
3
103.5
1
1234
secondary
1
43.1
2
3411
main
10
401.5
3
3411
secondary
2
77.5
4
9303
main
5
99
5
9303
secondary
4
104.6
What does reordering the column headers do?
df.columns = [df.columns[0],
*df.columns[1:].str.replace(r'(.*)_(.*)', r'\2_\1', regex=True)]
Produces:
Index(['customer_code', 'purchases_main', 'revenue_main',
'purchases_secondary', 'revenue_secondary'],
dtype='object')
The "type" column is now the suffix of the column header which allows wide_to_long to process the table as expected.
You can abstract the reshaping process with pivot_longer from pyjanitor; they are just a bunch of wrapper functions in Pandas:
#pip install pyjanitor
import pandas as pd
import janitor
df.pivot_longer(index = 'customer_code',
names_to=('type', '.value'),
names_sep='_',
sort_by_appearance=True)
customer_code type purchases revenue
0 1234 main 3 103.5
1 1234 secondary 1 43.1
2 3411 main 10 401.5
3 3411 secondary 2 77.5
4 9303 main 5 99.0
5 9303 secondary 4 104.6
The .value in names_to signifies to the function that you want that part of the column to remain as a header; the other part goes under the type column. The split is determined in this case by names_sep (there is a names_pattern option, that allows regular expression split); if you do not care about the order of appearance, you can set sort_by_appearance as False.
You can also use melt() and concat() function to solve this problem.
import pandas as pd
df1 = df.melt(
id_vars='customer_code',
value_vars=['main_purchases', 'secondary_purchases'],
var_name='type',
value_name='purchases',
ignore_index=True)
df2 = df.melt(
id_vars='customer_code',
value_vars=['main_revenue', 'secondary_revenue'],
var_name='type',
value_name='revenue',
ignore_index=True)
Then we use concat() with the parameter axis=1 to join side by side and use sort_values(by='customer_code') to sort data by customer.
result= pd.concat([df1,df2['revenue']],
axis=1,
ignore_index=False).sort_values(by='customer_code')
Using replace() with regex to align type names:
result.type.replace(r'_.*$','', regex=True, inplace=True)
The above code will output the below dataframe:
customer_code
type
purchases
revenue
0
1234
main
3
103.5
3
1234
secondary
1
43.1
1
3411
main
10
401.5
4
3411
secondary
2
77.5
2
9303
main
5
99
5
9303
secondary
4
104.6

How to fill missing dates with corresponding NaN in other columns

I have a CSV that initially creates following dataframe:
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-05 52304.0
Using the following script, I would like to fill the missing dates and have a corresponding NaN value in the Portfoliovalue column with NaN. So the result would be this:
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-02 NaN
2 2021-05-03 NaN
3 2021-05-04 NaN
4 2021-05-05 52304.0
I first tried the method here: Fill the missing date values in a Pandas Dataframe column
However the bfill replaces all my NaN's and removing it only returns an error.
So far I have tried this:
df = pd.read_csv("Tickers_test5.csv")
df2 = pd.read_csv("Portfoliovalues.csv")
portfolio_value = df['Currentvalue'].sum()
portfolio_value = portfolio_value + cash
date = datetime.date(datetime.now())
df2.loc[len(df2)] = [date, portfolio_value]
print(df2.asfreq('D'))
However, this only returns this:
Date Portfoliovalue
1970-01-01 NaN NaN
Thanks for your help. I am really impressed at how helpful this community is.
Quick update:
I have added the code, so that it fills my missing dates. However, it is part of a programme, which tries to update the missing dates every time it launches. So when I execute the code and no dates are missing, I get the following error:
ValueError: cannot reindex from a duplicate axis”
The code is as follows:
df2 = pd.read_csv("Portfoliovalues.csv")
portfolio_value = df['Currentvalue'].sum()
date = datetime.date(datetime.now())
df2.loc[date, 'Portfoliovalue'] = portfolio_value
#Solution provided by Uts after asking on Stackoverflow
df2.Date = pd.to_datetime(df2.Date)
df2 = df2.set_index('Date').asfreq('D').reset_index()
So by the looks of it the code adds a duplicate date, which then causes the .reindex() function to raise the ValueError. However, I am not sure how to proceed. Is there an alternative to .reindex() or maybe the assignment of today's date needs changing?
Pandas has asfreq function for datetimeIndex, this is basically just a thin, but convenient wrapper around reindex() which generates a date_range and calls reindex.
Code
df.Date = pd.to_datetime(df.Date)
df = df.set_index('Date').asfreq('D').reset_index()
Output
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-02 NaN
2 2021-05-03 NaN
3 2021-05-04 NaN
4 2021-05-05 52304.0
Pandas has reindex method: given a list of indices, it remains only indices from list.
In your case, you can create all the dates you want, by date_range for example, and then give it to reindex. you might needed a simple set_index and reset_index, but I assume you don't care much about the original index.
Example:
df.set_index('Date').reindex(pd.date_range(start=df['Date'].min(), end=df['Date'].max(), freq='D')).reset_index()
On first we set 'Date' column as index. Then we use reindex, it full list of dates (given by date_range from minimal date to maximal date in 'Date' column, with daily frequency) as new index. It result nans in places without former value.

Itering through DataFrame columns and deleting a row if cell value is not a number

I have the following csv file that I converted to a DataFrame:
apartment,floor,gasbill,internetbill,powerbill
401,4,120,nan,340
409,4,190,50,140
410,4,155,45,180
I want to be able to iterate each column, and if the value of a cell in internetbill column is not a number, delete that whole row. So in this example, the ''401,4,120,nan,340'' row would be eliminated from the DataFrame.
I thought that something like this would work, but I have no avail and I'm stuck
df.drop[df['internetbill'] == "nan"]
If you are using pd.read_csv then that nan will get imported as a np.nan. If so, then you need dropna
df.dropna(subset=['internetbill'])
apartment floor gasbill internetbill powerbill
1 409 4 190 50.0 140
2 410 4 155 45.0 180
If those are strings for whatever reason, you could do one of two things:
replace
df.replace({'internetbill': {'nan': np.nan}}).dropna(subset=['internetbill'])
to_numeric
df.assign(
internetbill=pd.to_numeric(df['internetbill'], errors='coerce')
).dropna(subset=['internetbill'])

Pandas Dataframe MultiIndex transform one level of the multiindex to another axis while keeping the other level in the original axis

I have a Pandas Dataframe with MultiIndex in the row indexers like this:
This dataframe is a result of a groupby operation and then slicing from a 3-level MultiIndex.I would like the 'date' row indexer to remain, but shift the 'SlabType' level of row indexers into column indexer with non-available values as NaN.
This is what I would like to get to:
What operations do I need to do to achieve this? Also if the title of the question can be improved, please suggest so.
Use unstack with select column SlabLT:
print (df['SlabLT'].unstack())
But if possible duplicates in MultiIndex is necessary aggregate column, a.g. by mean:
print (df.groupby(level=[0,1])['SlabLT'].mean().unstack())
Sample:
df = pd.DataFrame({'date':['2017-10-01','2017-10-08','2017-10-08','2017-10-15', '2017-10-15'],
'SlabType':['UOM2','AMOUNT','UOM2','AMOUNT','AMOUNT'],
'SlabLT':[1,6000,1,6000,5000]}).set_index(['date','SlabType'])
print (df)
SlabLT
date SlabType
2017-10-01 UOM2 1
2017-10-08 AMOUNT 6000
UOM2 1
2017-10-15 AMOUNT 6000 <-duplicated MultiIndex '2017-10-15', 'AMOUNT'
AMOUNT 5000 <-duplicated MultiIndex '2017-10-15', 'AMOUNT'
print (df['SlabLT'].unstack())
ValueError: Index contains duplicate entries, cannot reshape
print (df.groupby(level=[0,1])['SlabLT'].mean())
date SlabType
2017-10-01 UOM2 1
2017-10-08 AMOUNT 6000
UOM2 1
2017-10-15 AMOUNT 5500
Name: SlabLT, dtype: int64
print (df.groupby(level=[0,1])['SlabLT'].mean().unstack())
SlabType AMOUNT UOM2
date
2017-10-01 NaN 1.0
2017-10-08 6000.0 1.0
2017-10-15 5500.0 NaN
Since you have NaN values for some entries, you may want to consider pivot table to avoid "duplicate entries" ValueError when unstacking one of the indices.
Suppose you have df DataFrame with column 'SlabLT' with indices date and SlabType, try:
df.reset_index().pivot_table(values = 'SlabLT', index = 'date', columns = 'SlabLT')

Restrict index in Pandas Excelfile

I'm not sure I'm going to describe this right, but I'll try.
I have several excel files with about 20 columns and 10k or so rows. Let's say the column names are in the form col1, col2...col20.
Col2 is a timestamp column, so, for instance, a value could read: "2012-07-25 14:21:00".
I want to read the excel files into a DataFrame and perform some time series and grouping operations.
Here's some simplified code to load an excel file:
xl = pd.ExcelFile(os.path.join(dirname, filename))
df = xl.parse(xl.sheet_names[0], index_col=1) # Col2 above
When I run
df.index
it gives me:
<class 'pandas.tseries.index.DatetimeIndex'>
[2012-01-19 15:37:55, ..., 2012-02-02 16:13:42]
Length: 9977, Freq: None, Timezone: None
as expected. However, inspecting the columns, I get:
Index([u'Col1', u'Col2',...u'Col20'], dtype='object')
Which may be why I have problems with some of the manipulation I want to do. So for instance, when I run:
df.groupby[category_col].count()
I expect to get a dataframe with 1 row for each category and 1 column containing the count for that category. Instead, I get a dataframe with 1 row for each category and 19 columns describing the number of values for that column/category pair.
The same thing happens when I try to resample:
df.resample('D', how='count')
Instead of a single column Dataframe with the number of records per day, I get:
2012-01-01 Col1 8
Col2 8
Coln 8
2012-01-02 Col1 10
Col2 10
Coln 10
Is this normal behavior? How would I instead get just one value per day, category, whichever?
Based on this blog post from Wes McKinney, I think the problem is that I have to run my operations on a specific column, namely a column that I know won't have missing data.
So instead of doing:
df.groupby[category_col].count()
I should do:
df['col3'].groupby(df[category_col]).count()
and this:
df2.resample('D', how='count')
should be this:
df2['col3'].resample('D', how='count')
The results are more inline with what I'm looking for:
Category
Cat1 1232
Cat2 7677
Cat3 1053
Date
2012-01-01 8
2012-01-02 66
2012-01-03 89

Categories

Resources