I'm working with time series data. I need to find the previous week values for each entry. My data has 3 columns: ID, Date, and Value. I wanted to create a 4th column for LWValue (Last Week's Value). Here what the sample data might look like:
ID Date Value
0 1 2/1/2020 100
1 2 2/1/2020 80
2 1 2/2/2020 105
3 2 2/2/2020 84
4 1 2/8/2020 102
5 2 2/8/2020 82
6 1 2/9/2020 104
7 2 2/9/2020 86
Sample Data Image
How would I go about doing this in Pandas?
I tried this:
# create new column
df["LWValues"] = pd.Series()
# test out code on same values
df.loc[((df.ID == df.ID) & (df.Date == (df.Date) )), "LWValues"].values
# test out code with timedelta grabbing last week
df.loc[((df.ID == df.ID) & (df.Date == (df.Date - datetime.timedelta(days=7)) )), "LWValues"].values
When I do that, the 2nd operation works, but the final one, trying to pull data from the previous week with the timedelta argument does not. Instead, I get an empty array.
How do I need to fix this code?
Alternatively, is there a better way to get the previous week data in Pandas than this?
df.loc doesn't work that way. In your case, it just comparing each row with itself. One way to do this is by using apply.
df.apply(lambda row : df.loc[(df.ID == row['ID']) & (df.Date == (row['Date'] - datetime.timedelta(days=7)))]['Value'], axis=1)
Don't forget to handle the case where there's no last week entry for a row.
Related
I have a dataframe:
ColA ColB
0 A 1/2/2020
1 A 1/3/2020
2 A 1/4/2020
3 A 1/10/2020
4 B 1/3/2020
5 B 1/19/2020
6 C 1/2/2020
7 C 1/7/2020
8 D 1/8/2020
Now I want to find out the name of the series in colA which has three consecutive days in colB.
Output:
the answer would be A since it has 1/2/2020, 1/3/2020 and 1/4/2020 in colB.
A general approach would be like this:
# 1. To make sure the dates are sorted
df = df.sort_values(["ColA", "ColB"])
# 2. Standardize the dates by offseting them
df["ColB_std"] = df["ColB"] - pd.to_timedelta(range(df.shape[0]), 'day')
# 3. Counting each instance of ColA and standardized date
s = df.groupby(["ColA", "ColB_std"])["ColB_std"].count()
# 4. Getting elements from ColA that have at least 1 sequence of at least length 3
colA = s[ s >= 3 ].index.get_level_values(0).unique().values
# 5. Filtering the dataframe
df[ df["ColA"].isin(colA) ]
You want ColAs with 3 consecutive dates. Or you could think of it like you want ColAs where there's a sequence of date, date + 1 day and date + 2 days. By sorting the dataframe by ColA and ColB (1), we know that in the case you want to check, date + 1 day will always follow date, and date + 2 days will be the following that one.
With this, you can standardize the dates by removing n days corresponding to their row. So, the sequence of date, date + 1 day and date + 2 days, becomes date, date and date (2).
Now that we have the date column standardized, we just need to count how many elements each pair ('ColA', 'ColB_std') exist (3), get the elements from ColA that have counts of 3 or more (4), and filter the dataframe (5).
However, this doesn't support duplicated pairs of ('ColA', 'ColB'), for that you'd need to do this first:
df2 = df.drop_duplicates(["ColA", "ColB"])
Proceding to use this df2 in steps 1, 2, 3 and 4, and in the end filtering the real df in step 5.
Previously, I answered that you also could do it like this:
# To make sure the dates are sorted
df = df.sort_values(["ColA", "ColB"])
# Calculating the difference between dates inside each group
s = df.groupby("ColA")["ColB"].diff().dt.days
# Filtering the dataframe
df[ ((s == 1) & (s.shift(1) == 1)).groupby(df["ColA"]).transform("any") ]
The idea is that in s, the difference is always between the previous date, and the current date. However, this doesn't make sure that there are 3 consecutive dates, just 2. By shifting the series by 1, you are make sure that the current different and the previous one are 1 [ (s == 1) & (s.shift(1) == 1) ].
After that, I just groupby(df["ColA"]), and check if any element inside the group is true with transform("any").
I have a csv file that gets mailed to me every day and I want to write a script to clean up the data before I push it in a database. At the bottom of the csv file are 2 empty rows (Row 73 & 74 in image) and two rows with some junk data in them (Row 75 & 76 in image) and I need to delete these rows.
To identify the first empty row, it might be helpful to know that Column A will always have data in it until the first empty row (Row 73 in image).
Can you help me figure out how to identify these rows and delete the data in them?
You can check misisng values by Series.isna, create cumulative sum by Series.cumsum and filter only if equal 0 by boolean indexing. Also this solution working if no missing value in first column.
df = pd.DataFrame({'A':['as','bf', np.nan, 'vd', 'ss'],
'B':[1,2,3,4,5]})
print (df)
A B
0 as 1
1 bf 2
2 NaN 3
3 vd 4
4 ss 5
df = df[df['A'].isna().cumsum() == 0]
print (df)
A B
0 as 1
1 bf 2
I have a tall pandas dataframe called use with columns ID, Date, .... Each row is unique, but each ID has many rows, with one row ID per date.
ID Date Other_data
1 1-1-01 10
2 1-1-01 23
3 1-1-01 0
1 1-2-01 11
3 1-2-01 1
1 1-3-01 9
2 1-3-01 20
3 1-3-01 2
I also have a list of unique ids, ids=use['ID'].drop_duplicates
I want to find the intersection of all of the dates, that is, only the dates for which each ID has data. The end result in this toy problem should be [1-1-01, 1-3-01]
Currently, I loop through, subsetting by ID and taking the intersection. Roughly speaking, it looks like this:
dates = use['Date'].drop_duplicates()
for i in ids:
id_dates = use[(use['ID'] == i)]['Date'].values
dates = set(dates).intersection(id_dates)
This strikes me as horrifically inefficient. What is a more efficient way to identify dates where each ID has data?
Thanks very much!
Using crosstab, when the value is 0 should be the target row . using df.eq(0).any(1). to find it
df=pd.crosstab(use.ID,use.Date)
df
Out[856]:
Date 1-1-01 1-2-01 1-3-01
ID
1 1 1 1
2 1 0 1
3 1 1 1
Find the unique IDs per date, then check if that's all of them.
gp = df.groupby('Date').ID.nunique()
gp[gp == df.ID.nunique()].index.tolist()
#['1-1-01', '1-3-01']
At first I have two problems, the first will follow now:
I a dataframe df with many times the same userid and along with it a date and some unimportant other columns:
userid date
0 243 2014-04-01
1 234 2014-12-01
2 234 2015-11-01
3 589 2016-07-01
4 589 2016-03-01
I am currently trying to groupby them by userid and sort the dates descending and cut out the twelve oldest. My code looks like this:
df = df.groupby(['userid'], group_keys=False).agg(lambda x: x.sort_values(['date'], ascending=False, inplace=False).head(12))
And I get this error:
ValueError: cannot copy sequence with size 6 to array axis with dimension 12
At the moment my aim is to avoid to split the dataframe in individual ones.
My second problem is more complex:
I try to find out if the sorted dates (respectively per group of userids) are monthly consecutive. This means if there is an date for one group of userid, for example userid: 234 and date: 2014-04-01, the next entry below must be userid: 234 and date:2014-03-01. There is no focus on the day, only the year and month are important.
And only this consecutive 12 dates should be copied in another dataframe.
A second dataframe df2 contains the same userid, but they are unique and another column is 'code'. Here is an example:
userid code
0 433805 1
24 5448 0
48 3434 1
72 34434 1
96 3202 1
120 23766 1
153 39457 0
168 4113 1
172 3435 5
374 34093 1
I summarize: I try to check if there are 12 consecutive months per userid and copy every correct sequence in another dataframe. For this I have also compare the 'code' from df2.
This is a version of my code:
df['YearMonthDiff'] = df['date'].map(lambda x: 1000*x.year + x.month).diff()
df['id_before'] = df['userid'].shift()
final_df = pd.DataFrame()
for group in df.groupby(['userid'], group_keys=False):
fi = group[1]
if (fi['userid'] <> fi['id_before']) & group['YearMonthDiff'].all(-1.0) & df.loc[fi.userid]['code'] != 5:
final_df.append(group['userid','date', 'consum'])
At first calculated from the date an integer and made diff(). On other posts I saw they shift the column to compare the values from the current row and the row before. Then I made groupby(userid) to iterate over the single groups. Now it's extra ugly I tried to find the beginning of such an userid-group, try to check if there are only consecutive months and the correct 'code'. And at least I append it on the final dataframe.
On of the biggest problems is to compare the row with the following row. I can iterate over them with iterrow(), but I cannot compare them without shift(). There exits a calendar function, but on these I will take a look on the weekend. Sorry for the mess I am new to pandas.
Has anyone an idea how to solve my problem?
for your first problem, try this
df.groupby(by='userid').apply(lambda x: x.sort_values(by='date',ascending=False).iloc[[e for e in range(12) if e <len(x)]])
Using groupby and nlargest, we get the index values of those largest dates. Then we use .loc to get just those rows
df.loc[df.groupby('userid').date.nlargest(12).index.get_level_values(1)]
Consider the dataframe df
dates = pd.date_range('2015-08-08', periods=10)
df = pd.DataFrame(dict(
userid=np.arange(2).repeat(4),
date=np.random.choice(dates, 8, False)
))
print(df)
date userid
0 2015-08-12 0 # <-- keep
1 2015-08-09 0
2 2015-08-11 0
3 2015-08-15 0 # <-- keep
4 2015-08-13 1
5 2015-08-10 1
6 2015-08-17 1 # <-- keep
7 2015-08-16 1 # <-- keep
We'll keep the latest 2 dates per user id
df.loc[df.groupby('userid').date.nlargest(2).index.get_level_values(1)]
date userid
0 2015-08-12 0
3 2015-08-15 0
6 2015-08-17 1
7 2015-08-16 1
Maybe someone is interested, I solved my second problem thus:
I cast the date to an int, calculated the difference and I shift the userid one row, like in my example. And then follows this... found a solution on stackoverflow
gr_ob = df.groupby('userid')
gr_dict = gr_ob.groups
final_df = pd.DataFrame(columns=['userid', 'date', 'consum'])
for group_name in gr_dict.keys():
new_df = gr_ob.get_group(group_name)
if (new_df['userid'].iloc[0] <> new_df['id_before'].iloc[0]) & (new_df['YearMonthDiff'].iloc[1:] == -1.0).all() & (len(new_df) == 12):
final_df = final_df.append(new_df[['userid', 'date', 'consum']])
I am running a for loop for each of 12 months. For each month I get bunch of dates in random order over various years in history. I also have corresponding temperature data on those dates. e.g. if I am in month January, of loop all dates and temperature I get from history are for January only.
I want to start with empty pandas dataframe with two columns namely 'Dates' and 'Temperature'. As the loop progresses I want to add the dates from another month and corresponding data to the 'Temperature' column.
After my dataframe is ready I want to finally use the 'Dates'column as index to order the 'Temperature' history available so that I have correct historical sorted dates with their temperatures.
I have thought about using numpy array and storing dates and data in two separate arrays; sort the dates and then sort the temperature using some kind of index. I believe using pandas pivot table feature it will be better implemented in pandas.
#Zanam Pls refer this syntax. I think your question is similar to this answer
df = DataFrame(columns=('lib', 'qty1', 'qty2'))
for i in range(5):
df.loc[i] = [randint(-1,1) for n in range(3)]
print(df)
lib qty1 qty2
0 0 0 -1
1 -1 -1 1
2 1 -1 1
3 0 0 0
4 1 -1 -1
[5 rows x 3 columns]