I have 2 Data Frames which needs to be compared iteratively and mismatch rows has to be stored in a csv. Since it has historical dates, need to perform comparison based on year. How can this be achieve in Pandas
product_1 price_1 Date of purchase
0 computer 1200 2022-01-02
1 monitor 800 2022-01-03
2 printer 200 2022-01-04
3 desk 350 2022-01-05
product_2 price_2 Date of purchase
0 computer 900 2022-01-02
1 monitor 800 2022-01-03
2 printer 300 2022-01-04
3 desk 350 2022-01-05
I would use a split/merge/where
df1['Date of purchase'] = df1['Date of purchase'].apply(lambda x : x.split('-')[0])
df2['Date of purchase'] = df2['Date of purchase'].apply(lambda x : x.split('-')[0])
From there you can merge the two columns using a join or merge
After that you can use an np.where()
merge_df['Check'] = np.where(merge_df['comp_column'] != merge_df['another_comp_column'])
From there you can just look for where the comp columns didn't match
merge_df.loc[merge_df['Check'] == False]
First, let's solve the problem for any group of dates/years. First, you could merge your data using the date and product names:
df = df1.merge(df2, left_on=["Date of purchase", "product_1"], right_on=["Date of purchase", "product_2"])
# Bonus points if you rename "product_2" and only use `on` instead of `left_on` and `right_on`
After that, you could simply use .loc to find the rows where prices do not match:
df.loc[df["price_1"] != df["price_2"]])
product_1 price_1 Date of purchase product_2 price_2
0 computer 1200 2022-01-02 computer 900
2 printer 200 2022-01-04 printer 300
Now, you could process each year by iterating a list of years, querying only the data from that year on df1 and df2 and then using the above procedure to find the price mismatches:
# List available years
years = pd.concat([df1["Date of purchase"].dt.year, df2["Date of purchase"].dt.year], axis=0).unique()
# Rename columns for those bonus points
df1 = df1.rename(columns={"product_1": "product"})
df2 = df2.rename(columns={"product_2": "product"})
# Accumulate your rows in a new dataframe (starting from a list)
output_rows = list()
for year in years:
# find data for this `year`
df1_year = df1.loc[df1["Date of purchase"].dt.year == year]
df2_year = df2.loc[df2["Date of purchase"].dt.year == year]
# Apply the procedure described at the beginning
df = df1_year .merge(df2_year , on=["Date of purchase", "product"])
# Find rows where prices do no match
mismatch_rows = df.loc[df["price_1"] != df["price_2"]]
output_rows.append(mismatch_rows)
# Now, transform your rows into a single dataframe
output_df = pd.concat(output_rows)
Output:
product price_1 Date of purchase price_2
0 computer 1200 2022-01-02 900
2 printer 200 2022-01-04 300
Related
I am trying to expand a dataframe containing a number of columns by creating rows based on the interval between two date columns.
For this I am currently using a method that basically creates a cartesian product, which works well on small datasets, but is not good in large sets because it is very inefficient.
This method will be used on a ~ 2-million row by 50 column Dataframe spanning multiple years from min to max date. The resulting dataset will be about 3 million rows, so a more effective approach is required.
I have not succeeded in finding an alternative method which is less resource intensive.
What would be the best approach for this?
My current method here:
from datetime import date
import pandas as pd
raw_data = {'id': ['aa0', 'aa1', 'aa2', 'aa3'],
'number': [1, 2, 2, 1],
'color': ['blue', 'red', 'yellow', "green"],
'date_start': [date(2022,1,1), date(2022,1,1), date(2022,1,7), date(2022,1,12)],
'date_end': [date(2022,1,2), date(2022,1,4), date(2022,1,9), date(2022,1,14)]}
df = pd.DataFrame(raw_data)
This gives the following result
Now to create a set containing all possible dates between the min and max date of the set:
df_d = pd.DataFrame({'date': pd.date_range(df['date_start'].min(), df['date_end'].max() + pd.Timedelta('1d'), freq='1d')})
This results in an expected frame containing all the possible dates
Finally to cross merge the original set with the date set and filter resulting rows based on start and end date per row
df_total = pd.merge(df, df_d,how='cross')
df = df_total[(df_total['date_start']<df_total['date']) & (df_total['date_end']>=df_total['date']) ]
This leads to the following final
This final dataframe is exactly what is needed.
Efficient Solution
d = df['date_end'].sub(df['date_start']).dt.days
df1 = df.reindex(df.index.repeat(d))
i = df1.groupby(level=0).cumcount() + 1
df1['date'] = df1['date_start'] + pd.to_timedelta(i, unit='d')
How it works?
Subtract start from end to calculate the number of days elapsed, then reindex the dataframe by repeating the index exactly elapsed number of days times. Now group df1 by index and use cumcount to create a sequential counter then create a timedelta series using this counter and add this with date_start to get the result
Result
id number color date_start date_end date
0 aa0 1 blue 2022-01-01 2022-01-02 2022-01-02
1 aa1 2 red 2022-01-01 2022-01-04 2022-01-02
1 aa1 2 red 2022-01-01 2022-01-04 2022-01-03
1 aa1 2 red 2022-01-01 2022-01-04 2022-01-04
2 aa2 2 yellow 2022-01-07 2022-01-09 2022-01-08
2 aa2 2 yellow 2022-01-07 2022-01-09 2022-01-09
3 aa3 1 green 2022-01-12 2022-01-14 2022-01-13
3 aa3 1 green 2022-01-12 2022-01-14 2022-01-14
I don't know if this is an approvement, here the pd.date_range only gets created for each start and end date in each row. the created list gets exploded and joined to the original df
from datetime import date
import pandas as pd
raw_data = {'id': ['aa0', 'aa1', 'aa2', 'aa3'],
'number': [1, 2, 2, 1],
'color': ['blue', 'red', 'yellow', "green"],
'date_start': [date(2022,1,1), date(2022,1,1), date(2022,1,7), date(2022,1,12)],
'date_end': [date(2022,1,2), date(2022,1,4), date(2022,1,9), date(2022,1,14)]}
df = pd.DataFrame(raw_data)
s = df.apply(lambda x: pd.date_range(x['date_start'], x['date_end'], freq='1d',inclusive='right').date,axis=1).explode()
df.join(s.rename('date'))
I've got a dataframe with data taken from a database like this:
conn = sqlite3.connect('REDB.db')
dataAvg1 = pd.read_sql_query(
"SELECT UNIQUE_RE_NUMBER, TYP_ID, LOCATION, RE_PRICE, PRICE.RE_ID, PRICE.UPDATE_DATE, HOUSEINFO.RE_POLOHA, HOUSEINFO.RE_DRUH, HOUSEINFO.RE_TYP, HOUSEINFO.RE_UPLOCHA FROM PRICE INNER JOIN REAL_ESTATE, ADDRESS, HOUSEINFO ON REAL_ESTATE.ID=PRICE.RE_ID AND REAL_ESTATE.ID=ADDRESS.RE_ID AND REAL_ESTATE.ID=HOUSEINFO.INF_ID",conn
)
dataAvg2 = pd.read_sql_query(
"SELECT UNIQUE_RE_NUMBER, TYP_ID, LOCATION, RE_PRICE, PRICE.RE_ID, PRICE.UPDATE_DATE, FLATINFO.RE_DISPOZICE, FLATINFO.RE_DRUH, FLATINFO.RE_PPLOCHA FROM PRICE INNER JOIN REAL_ESTATE, ADDRESS, FLATINFO ON REAL_ESTATE.ID=PRICE.RE_ID AND REAL_ESTATE.ID=ADDRESS.RE_ID AND REAL_ESTATE.ID=FLATINFO.INF_ID",conn
)
dataAvg3 = pd.read_sql_query(
"SELECT UNIQUE_RE_NUMBER, TYP_ID, LOCATION, RE_PRICE, PRICE.RE_ID, PRICE.UPDATE_DATE, LANDINFO.RE_PLOCHA, LANDINFO.RE_DRUH, LANDINFO.RE_SITE, LANDINFO.RE_KOMUNIKACE FROM PRICE INNER JOIN REAL_ESTATE, ADDRESS, LANDINFO ON REAL_ESTATE.ID=PRICE.RE_ID AND REAL_ESTATE.ID=ADDRESS.RE_ID AND REAL_ESTATE.ID=LANDINFO.INF_ID",conn
)
conn.close()
df2 = [dataAvg1, dataAvg2, dataAvg3]
dfAvg = pd.concat(df2)
dfAvg = dfAvg.reset_index(drop=True)
The main columns are UNIQUE_RE_NUMBER, RE_PRICE and UPDATE_DATE. I would like to count frequency of change in prices each day. Ideally create a new column called 'Frequency' and for each day add a number. For example:
UPDATE_DAY UNIQUE_RE_NUMBER RE_PRICE FREQUENCY
1.1.2021 1 500 2
1.1.2021 2 400 2
2.1.2021 1 500 1
2.1.2021 2 450 1
I hope this example is understandable.
Right now I have something like this:
dfAvg['FREQUENCY'] = dfAvg.groupby('UPDATE_DATE')['UPDATE_DATE'].transform('count')
dfAvg.drop_duplicates(subset=['UPDATE_DATE'], inplace=True)
This code counts every price added that day, so when the price of real estate on 1.1.2021 was 500 and the next day, its also 500, it counts as "change" in price, but in fact the price stayed the same and I dont want to count that. I would like to select only distinct values in prices for each real estate. Is it possible?
Not sure if this is the most efficient way, but maybe it helps:
def ident_deltas(sdf):
return sdf.assign(
DELTA=(sdf.RE_PRICE.shift(1) != sdf.RE_PRICE).astype(int)
)
def sum_deltas(sdf):
return sdf.assign(FREQUENCY=sdf.DELTA.sum())
df = (
df.groupby("UNIQUE_RE_NUMBER").apply(ident_deltas)
.groupby("UPDATE_DAY").apply(sum_deltas)
.drop(columns="DELTA")
)
Result for
df =
UPDATE_DAY UNIQUE_RE_NUMBER RE_PRICE
0 2021-01-01 1 500
1 2021-01-01 2 400
2 2021-02-01 1 500
3 2021-02-01 2 450
is
UPDATE_DAY UNIQUE_RE_NUMBER RE_PRICE FREQUENCY
0 2021-01-01 1 500 2
1 2021-01-01 2 400 2
2 2021-02-01 1 500 1
3 2021-02-01 2 450 1
I want to perform rolling median on price column over 4 days back, data will be groupped by date. So basically I want to take prices for a given day and all prices for 4 days back and calculate median out of these values.
Here are the sample data:
id date price
1637027 2020-01-21 7045204.0
280955 2020-01-11 3590000.0
782078 2020-01-28 2600000.0
1921717 2020-02-17 5500000.0
1280579 2020-01-23 869000.0
2113506 2020-01-23 628869.0
580638 2020-01-25 650000.0
1843598 2020-02-29 969000.0
2300960 2020-01-24 5401530.0
1921380 2020-02-19 1220000.0
853202 2020-02-02 2990000.0
1024595 2020-01-27 3300000.0
565202 2020-01-25 3540000.0
703824 2020-01-18 3990000.0
426016 2020-01-26 830000.0
I got close with combining rolling and groupby:
df.groupby('date').rolling(window = 4, on = 'date')['price'].median()
But this seems to add one row per each index value and by median definition, I am not able to somehow merge these rows to produce one result per row.
Result now looks like this:
date date
2020-01-10 2020-01-10 NaN
2020-01-10 NaN
2020-01-10 NaN
2020-01-10 3070000.0
2020-01-10 4890000.0
...
2020-03-11 2020-03-11 4290000.0
2020-03-11 3745000.0
2020-03-11 3149500.0
2020-03-11 3149500.0
2020-03-11 3149500.0
Name: price, Length: 389716, dtype: float64
It seems it just deleted 3 first values and then just printed price value.
Is it possible to get one lagged / moving median value per one date?
You can use rolling with a frequency window of 5 days to get today and last 4 days, then drop_duplicates to keep the last row per day. First create a copy (if you want to keep the original one), sort_values per date and ensure the date column is datetime
#sort and change to datetime
df_f = df[['date','price']].copy().sort_values('date')
df_f['date'] = pd.to_datetime(df_f['date'])
#create the column rolling
df_f['price'] = df_f.rolling('5D', on='date')['price'].median()
#drop_duplicates and keep the last row per day
df_f = df_f.drop_duplicates(['date'], keep='last').reset_index(drop=True)
print (df_f)
date price
0 2020-01-11 3590000.0
1 2020-01-18 3990000.0
2 2020-01-21 5517602.0
3 2020-01-23 869000.0
4 2020-01-24 3135265.0
5 2020-01-25 2204500.0
6 2020-01-26 849500.0
7 2020-01-27 869000.0
8 2020-01-28 2950000.0
9 2020-02-02 2990000.0
10 2020-02-17 5500000.0
11 2020-02-19 3360000.0
12 2020-02-29 969000.0
This is a step by step process. There are probably more efficient methods of getting what you want. Note, if you have time information for your dates, you would need to drop that information before grouping by date.
import pandas as pd
import statistics as stat
import numpy as np
# Replace with you data import
df = pd.read_csv('random_dates_prices.csv')
# Convert your date to a datetime
df['date'] = pd.to_datetime(df['date'])
# Sort your data by date
df = df.sort_values(by = ['date'])
# Create group by object
dates = df.groupby('date')
# Reformat dataframe for one row per day, with prices in a nested list
df = pd.DataFrame(dates['price'].apply(lambda s: s.tolist()))
# Extract price lists to a separate list
prices = df['price'].tolist()
# Initialize list to store past four days of prices for current day
four_days = []
# Loop over the prices list to combine the last four days to a single list
for i in range(3, len(prices), 1):
x = i - 1
y = i - 2
z = i - 3
four_days.append(prices[i] + prices[x] + prices[y] + prices[z])
# Initialize a list to store median values
medians = []
# Loop through four_days list and calculate the median of the last for days for the current date
for i in range(len(four_days)):
medians.append(stat.median(four_days[i]))
# Create dummy zero values to add lists create to dataframe
four_days.insert(0, 0)
four_days.insert(0, 0)
four_days.insert(0, 0)
medians.insert(0, 0)
medians.insert(0, 0)
medians.insert(0, 0)
# Add both new lists to data frames
df['last_four_day_prices'] = four_days
df['last_four_days_median'] = medians
# Replace dummy zeros with np.nan
df[['last_four_day_prices', 'last_four_days_median']] = df[['last_four_day_prices', 'last_four_days_median']].replace(0, np.nan)
# Clean data frame so you only have a single date a median value for past four days
df_clean = df.drop(['price', 'last_four_day_prices'], axis=1)
I've got order data with SKUs inside and would like to find out, how often a SKU has been bought per month over the last 3 years.
for row in df_skus.iterrows():
df_filtered = df_orders.loc[df_orders['item_sku'] == row[1]['sku']]
# Remove unwanted rows:
df_filtered = df_filtered[['txn_id', 'date', 'item_sku']].copy()
# Group by year and date:
df_result = df_filtered['date'].groupby([df_filtered.date.dt.year, df_filtered.date.dt.month]).agg('count')
print ( df_result )
print ( type ( df_result ) )
The (shortened) result looks good so far:
date date
2017 3 1
Name: date, dtype: int64
date date
2017 2 1
3 6
4 1
6 1
Name: date, dtype: int64
Now, I'd like to create a CSV which looks like that:
SKU 2017-01 2017-02 2017-03
17 0 0 1
18 0 1 3
Is it possible to simply 'convert' my data into the desired structure?
I do these kind of calculations all the time and this seems to be the fastest.
import pandas as pd
df_orders = df_orders[df_orders["item_sku"].isin(df_skus["sku"])]
monthly_sales = df_orders.groupby(["item_sku", pd.Grouper(key="date",freq="M")]).size()
monthly_sales = monthly_sales.unstack(0)
monthly_sales.to_csv("my_csv.csv")
first line filters to the SKUs you want
the second line does a groupby and counts the number of sales per sku per month
the next line changes the dataframe from a multi index to the format you want
exports to csv
I have a data frame with this column name
timestamp,stockname,total volume traded
There are multiple stock names at each time frame
11:00,A,100
11:00,B,500
11:01,A,150
11:01,B,600
11:02,A,200
11:02,B,650
I want to create a ChangeInVol column such that each stock carries its own difference like
timestamp, stock,total volume, change in volume
11:00,A,100,NaN
11:00,B,500,NAN
11:01,A,150,50
11:01,B,600,100
11:02,A,200,50
11:03,B,650,50
If it were a single stock, I could have done
df['ChangeVol'] = df['TotalVol'] - df['TotalVol'].shift(1)
but there are multiple stocks
Need sort_values + DataFrameGroupBy.diff:
#if columns not sorted
df = df.sort_values(['timestamp','stockname'])
df['change in volume'] = df.groupby('stockname')['total volume traded'].diff()
print (df)
timestamp stockname total volume traded change in volume
0 11:00 A 100 NaN
1 11:00 B 500 NaN
2 11:01 A 150 50.0
3 11:01 B 600 100.0
4 11:02 A 200 50.0
5 11:02 B 650 50.0