I’ve two imported Panda DataFrames out of Excel (df1 and df2). Df1 represents the dates of replacement consisting out of a column with Dates and a column with Notes (200 rows). Df2 represents the dates when a check was performed (40 rows).
I would like to filter df1 (or generate a new table (df1')), that all dates of df1 which differ less than 5 days with the dates of df2 will be deleted in df1.
As a check is performed, we could say that the component was not replaced within a margin of 10 days.
e.g.
df1
22/04/2017
23/04/2017
07/06/2017
20/08/2017
df2
21/04/2017
df1'
07/06/2017
20/08/2017
You can perform datetime subtraction with numpy broadcasting and filter df1 accordingly.
df1
A
0 2017-04-22
1 2017-04-23
2 2017-07-06
3 2017-08-20
df2
A
0 2017-04-21
df1.A = pd.to_datetime(df1.A) # convert to datetime first
df2.A = pd.to_datetime(df2.A)
df1[((df1.values[:, None] - df2.values) / pd.Timedelta(days=1) > 5).all(1)]
A
2 2017-07-06
3 2017-08-20
For your data, this will generate 8000 elements on broadcasted subtraction, which certainly is manageable. Though note for much larger data, this results in a memory blowup (a pricey tradeoff for the high performance).
Related
I have a data frame with measurements for several groups of participants, and I am doing some calculations for each group. I want to add a column in a big data frame (all participants), from secondary data frames (partial list of participants).
When I do merge a couple of times (merging a new data frame to the existing one), it creates a duplicate of the column instead of one column.
As the size of the dataframes is different I can not compare them directly.
I tried
#df1 - main bigger dataframe, df2 - smaller dataset contains group of df1
for i in range(len(df1)):
# checking indeces to place the data to correct participant:
if df1.index[i] not in df2['index']:
pass
else :
df1['rate'][i] = list(df2[rate][df2['index']==i])
It does not work properly though. Can you please help with the correct way of assembling the column?
update: where the index of the initial dataframe and the "index" column of the calculation is the same, copy the rate value from the calculation into main df
main dataframe 1df
index
rate
1
0
2
0
3
0
4
0
5
0
6
0
dataframe with calculated values
index
rate
1
value
4
value
6
value
output df
index
rate
1
value
2
0
3
0
4
value
5
0
6
value
Try this – using .join() to merge dataframes on their indices and combining two columns using .combine_first():
df = df1.join(df2, lsuffix="_df1", rsuffix="_df2")
df["rate"] = df["rate_df2"].combine_first(df["rate_df1"])
EDIT:
This assumes both dataframes use a matching index. If that is not the case for df2, run this first:
df2 = df2.set_index('index')
I have df1 and df2, where df1 is a balanced panel of 20 stocks with daily datetime data. Due to missing days (weekends, holidays) I am assigning each day available to an integer of how many days I have (1-252). df2 is a 2 column matrix which maps each day to the integer.
df2
date integer
2020-06-26, 1
2020-06-29, 2
2020-06-30, 3
2020-07-01, 4
2020-07-02, 5
...
2021-06-25, 252
I would like to map these dates to every asset I have in df1 for each date, therefore returning a single column of (0-252) repeated for each asset.
So far I have tried this:
df3 = (df1.merge(df2, left_on='date', right_on='integer'))
which returns an empty dataframe - I dont think I'm fully understanding the logic here
Assuming both df1 and df2 having the same column label as date hence,
df3 = df1.merge(df2)
I am trying to merge two weelly DateFrames, which are made-up of one column each, but with different lengths.
Could I please know how to merge them, maintaining the 'Week' indexing?
[df1]
Week Coeff1
1 -0.456662
1 -0.533774
1 -0.432871
1 -0.144993
1 -0.553376
... ...
53 -0.501221
53 -0.025225
53 1.529864
53 0.044380
53 -0.501221
[16713 rows x 1 columns]
[df2]
Week Coeff
1 0.571707
1 0.086152
1 0.824832
1 -0.037042
1 1.167451
... ...
53 -0.379374
53 1.076622
53 -0.547435
53 -0.638206
53 0.067848
[63265 rows x 1 columns]
I've tried this code:
df3 = pd.merge(df1, df2, how='inner', on='Week')
df3 = df3.drop_duplicates()
df3
But it gave me a new df (df3) with 13386431 rows × 2 columns
Desired outcome: A new df which has 3 columns (week, coeff1, coeff2), as df2 is longer, I expect to have some NaNs in coeff1 to fill the gaps.
I assume your output should look somewhat like this:
Week
Coeff1
Coeff2
1
-0.456662
0.571707
1
-0.533774
0.086152
1
-0.432871
0.824832
2
3
3
2
NaN
3
Don't mind the actual numbers though.
The problem is you won't achieve that with a join on Week, neither left nor inner and that is due to the fact that the Week-Index is not unique.
So, on a left join, pandas is going to join all the Coeff2-Values where df2.Week == 1 on every single row in df1 where df1.Week == 1. And that is why you get these millions of rows.
I will try and give you a workaround later, but maybe this helps you to think about this problem from another perspective!
Now is later:
What you actually want to do is to concatenate the Dataframes "per week".
You achieve that by iterating over every week, creating a df_subset[week] concatenating df1[week] and df2[week] by axis=1 and then concatenating all these subsets on axis=0 afterwards:
weekly_dfs=[]
for week in df1.Week.unique():
sub_df1 = df1.loc[df1.Week == week, "Coeff1"].reset_index(drop=True)
sub_df2 = df2.loc[df2.Week == week, "Coeff2"].reset_index(drop=True)
concat_df = pd.concat([sub_df1, sub_df2], axis=1)
concat_df["Week"] = week
weekly_dfs.append(concat_df)
df3 = pd.concat(weekly_dfs).reset_index(drop=True)
The last reset of the index is optional but I recommend it anyways!
Based on your last comment on the question, you may want to concatenate instead of merging the two data frames:
df3 = pd.concat([df1,df2], ignore_index=True, axis=1)
The resulting DataFrame should have 63265 rows and will need some work to get it to the required format (remove the added index columns, rename the remaining columns, etc.), but pd.concat should be a good start.
According to pandas' merge documentation, you can use merge in a way like that:
What you are looking for is a left join. However, the default option is an inner join. You can change this by passing a different how argument:
df2.merge(df1,how='left', left_on='Week', right_on='Week')
note that this would keep these rows in the bigger df and assign NaN to them when merging with the shorter df.
I am a doctor looking at surgical activity in a DataFrame that has 450,000 records and 129 fields or columns. Each row describes a patient admission to hospital, and the columns describe the operation codes and diagnosis codes for the patient.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 452883 entries, 0 to 452882
Columns: 129 entries, fyear to opertn_24Type
dtypes: float64(5), int64(14), object(110)
memory usage: 445.7+ MB
There are 24 operation columns for each row. I want to search in the operation columns (1-24) for the codes for pituitary surgery "B041" and "B012", to identify all patients who have had surgery for a pituitary tumor.
I am a total python beginner and have tried using iloc to describe the range of columns (1-24) which appear starting at position 72 in the list of columns but couldn't get it to work.
I am quite happy searching for individual values eg "B041" in a single column using
df["OPERTN_01"] == "B041"
but would ideally like to search multiple columns (all the surgical columns 1-24) more efficiently.
I tried searching the whole dataframe using
test = df[df.isin(["B041", "B012"])]
but that just returns the entire dataframe with null values.
So I have a few questions.
How do I identify integer positions (iloc numbers) for columns in a large dataframe of 129 columns? I just listed them and counted them to get the first surgical column ("OPERTN_01") at position 72 — there must be an easier way.
What's the best way to slice a dataframe to select records with multiple values from multiple columns?
Let's use .iloc and create a boolean for you to filter by:
import pandas as pd
import numpy as np
np.random.seed(12)
df = pd.DataFrame({"A" : ["John","Deep","Julia","Kate","Sandy"],
"result_1" : np.random.randint(5,15,5),
"result_2" : np.random.randint(5,15,5) })
print(df)
A result_1 result_2
0 John 11 5
1 Deep 6 11
2 Julia 7 6
3 Kate 8 9
4 Sandy 8 10
Next we need to find your intended values in the selected columns:
df.iloc[:,1:27].isin([11,10])
this returns:
result_1 result_2
0 True False
1 False True
2 False False
3 False False
4 False True
From the above, we need to slice our original dataframe by the rows where any value is true (if I've understood you correctly).
For this we can use np.where() with .loc:
df.loc[np.where(df.iloc[:,1:].isin([11,10])==True)[0]]
A result_1 result_2
0 John 11 5
1 Deep 6 11
4 Sandy 8 10
From here it's a simple task to extract your unique IDs.
Answer 1:
Let's say you are looking for eg_col in your dataframe's columns. Then, you can find its index within the columns using:
df.columns.tolist().index('eg_col')
Answer 2:
In your example, if you know the name of the last surgical column (let's say it's called OPERTN_24, you can slice those columns using:
df_op = df.loc[:, 'OPERTN_01':'OPERTN_24']
Continuing from that, we can look for values of 'B041', 'B012' in df_surg as you tried: df_op.isin['B041', 'B012'] which will return the boolean value for all dataframe entries.
To extract, for example, only those rows where at least one of our 'B041' values comes up, we select those rows with:
df.index[df_surg.isin(['B041', 'B012']).any(axis=1)]
I am running a for loop for each of 12 months. For each month I get bunch of dates in random order over various years in history. I also have corresponding temperature data on those dates. e.g. if I am in month January, of loop all dates and temperature I get from history are for January only.
I want to start with empty pandas dataframe with two columns namely 'Dates' and 'Temperature'. As the loop progresses I want to add the dates from another month and corresponding data to the 'Temperature' column.
After my dataframe is ready I want to finally use the 'Dates'column as index to order the 'Temperature' history available so that I have correct historical sorted dates with their temperatures.
I have thought about using numpy array and storing dates and data in two separate arrays; sort the dates and then sort the temperature using some kind of index. I believe using pandas pivot table feature it will be better implemented in pandas.
#Zanam Pls refer this syntax. I think your question is similar to this answer
df = DataFrame(columns=('lib', 'qty1', 'qty2'))
for i in range(5):
df.loc[i] = [randint(-1,1) for n in range(3)]
print(df)
lib qty1 qty2
0 0 0 -1
1 -1 -1 1
2 1 -1 1
3 0 0 0
4 1 -1 -1
[5 rows x 3 columns]