Pandas: Drop rows with values less than today using np.where()? - python

Given the following dataset, and current week as 2019/W37, how do I drop rows that are previous to current week using np.where?
Year Week Value
0 2019 31 10
1 2019 32 20
2 2019 33 30
3 2019 34 40
4 2019 35 50
5 2019 36 60
6 2019 37 70
7 2019 38 80
8 2019 39 90
9 2019 40 100
I tried the following:
import pandas as pd
import numpy as np
from datetime import datetime
data = {
"Year": [2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019],
"Week": [31, 32, 33, 34, 35, 36, 37, 38, 39, 40],
"Value": [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
}
df = pd.DataFrame(data)
print(df)
YearWeek = datetime.now().strftime("%Y/W%V")
print(YearWeek)
df["Exclude"] = np.where(str(df["Year"] + "/" + df["Week"]) < YearWeek, "Yes", "No")
print(df)

try this:
df_new = df[pd.to_datetime((df["Year"].astype(str) + "/W" + df["Week"].astype(str), format="%Y/W%V", errors='ignore') >= YearWeek]
or using np.where()
df.iloc[np.where(pd.to_datetime((df["Year"].astype(str) + "/W" + df["Week"].astype(str)), format="%Y/W%V", errors='ignore') >= YearWeek )]
To generate the exclude column:
df['exclude'] = np.where(pd.to_datetime((df["Year"].astype(str) + "/W" + df["Week"].astype(str)), format="%Y/W%V", errors='ignore') < YearWeek, 'Yes', 'No' )

>>> print(df)
Year Week Value
0 2019 31 10
1 2019 32 20
2 2019 33 30
3 2019 34 40
4 2019 35 50
5 2019 36 60
6 2019 37 70
7 2019 38 80
8 2019 39 90
9 2019 40 100
>>> today = pd.to_datetime('today')
>>> today
Timestamp('2019-09-12 22:54:46.039542')
>>> df[(df.Week < today.week) | (df.Year < today.year)]
Year Week Value
0 2019 31 10
1 2019 32 20
2 2019 33 30
3 2019 34 40
4 2019 35 50
5 2019 36 60

You can use a decimal week system:
w = df['Year'] + df['Week'] / 54
now = pd.Timestamp.now()
this_week = now.year + now.week / 54
df[w >= this_week]
Result
Year Week Value
6 2019 37 70
7 2019 38 80
8 2019 39 90
9 2019 40 100
In the ISO Date System, a year can have up to 53 weeks so we use 54 to prevent the last week of year N appearing like year N+1. Anything over 54 works just as well. It's just a way for us to combine the year and the week into a single, comparable quantity.

We can do
df[(df.Year*100+df.Week)<int(pd.to_datetime('today').strftime('%Y%W'))]

Related

Select date columns in python based on specific date criteria

This is my sample code. My database contains columns for every date of the year, going back multiple years. Each column corresponds to a specific date.
import pandas as pd
df = pd.DataFrame([[10, 5, 25, 67,25,56],
[20, 10, 26, 45, 56, 34],
[30, 3, 27, 34, 78, 34],
[40, 9, 28, 45, 34,76]],
columns=[pd.to_datetime('2022-09-14'), pd.to_datetime('2022-08-14'), pd.to_datetime('2022-07-14'), pd.to_datetime('2021-09-14'),
pd.to_datetime('2020-09-14'), pd.to_datetime('2019-09-14')])
Is there a way to select only those columns which fit a particular criteria based on year, month or quarter.
For example, I was hoping to get only those columns which is the same date as today (any starting date) for every year. For example, today is Sep 14, 2022 and I need columns only for Sep 14, 2021, Sep 14, 2020 and so on. Another option could be to do the same on a month or quarter basis.
How can this be done in pandas?
Yes, you can do:
# day
df.loc[:, df.columns.day == 14]
2022-09-14 2022-08-14 2022-07-14 2021-09-14 2020-09-14 2019-09-14
0 10 5 25 67 25 56
1 20 10 26 45 56 34
2 30 3 27 34 78 34
3 40 9 28 45 34 76
# month
df.loc[:, df.columns.month == 9]
2022-09-14 2021-09-14 2020-09-14 2019-09-14
0 10 67 25 56
1 20 45 56 34
2 30 34 78 34
3 40 45 34 76
# quarter
df.loc[:, df.columns.quarter == 3]
2022-09-14 2022-08-14 2022-07-14 2021-09-14 2020-09-14 2019-09-14
0 10 5 25 67 25 56
1 20 10 26 45 56 34
2 30 3 27 34 78 34
3 40 9 28 45 34 76

Concat two array in a specific condition?

I require to concat two arrays of unequal size:
Array-1:
A = ["year","month","day","hour","minute","second", "a", "b", "c", "d"]
data1 = pd.read_csv('event_5.txt',sep='\t',names=A)
array1=data1[['year', 'month', 'day']]
Array-2:
B=["station", "phase", "hour", "minute", "second"]
arr_data = pd.read_csv('arrival_5.txt',sep='\t',names=B)
ar_t= arr_data[['hour', 'minute', 'second']]
array2 = pd.DataFrame(ar_t)
The required output is shown below: here, [2019 11 9] is the array-1 reshaped to match the dimensions of the second array and then concat. However, in the case of reshaping, I need to check the dimensions of the second array every time. Therefore, I need an automated script that can concat unequal arrays.
Array-1: The first array always have the same dimensions
year month day
0 2019 11 9
Array-2: Variable dimension columns are fixed but rows change for each iteration:
hour minute second
0 14 57 41.80
1 14 58 3.47
2 14 57 25.99
3 14 57 37.00
4 14 57 29.86
5 14 57 40.24
6 14 57 32.61
7 14 57 42.26
8 14 57 29.74
9 14 57 42.36
10 14 57 46.00
11 14 58 8.69
12 14 57 34.50
13 14 57 48.97
14 14 57 30.30
15 14 57 39.78
16 14 57 32.45
17 14 57 47.83
18 14 57 25.86
19 14 57 36.30
20 14 57 17.90
21 14 57 23.40
22 14 57 34.64
23 14 57 50.95
24 14 57 35.90
25 14 57 50.64
Required output:
Year month day hour minute second
0 2019 11 9 14 57 41.80
1 2019 11 9 14 58 3.47
2 2019 11 9 14 57 25.99
3 2019 11 9 14 57 37.00
4 2019 11 9 14 57 29.86
5 2019 11 9 14 57 40.24
6 2019 11 9 14 57 32.61
7 2019 11 9 14 57 42.26
8 2019 11 9 14 57 29.74
9 2019 11 9 14 57 42.36
10 2019 11 9 14 57 46.00
11 2019 11 9 14 58 8.69
12 2019 11 9 14 57 34.50
13 2019 11 9 14 57 48.97
14 2019 11 9 14 57 30.30
15 2019 11 9 14 57 39.78
16 2019 11 9 14 57 32.45
17 2019 11 9 14 57 47.83
18 2019 11 9 14 57 25.86
19 2019 11 9 14 57 36.30
20 2019 11 9 14 57 17.90
21 2019 11 9 14 57 23.40
22 2019 11 9 14 57 34.64
23 2019 11 9 14 57 50.95
24 2019 11 9 14 57 35.90
25 2019 11 9 14 57 50.64
Assigning a constant value to a DataFrame column
If your first array is always a single-row dataframe, or a monodimensional array, then you can just use pandas to assign a constant value to a column.
The syntax is my_dataframe["new_column"] = constant_value.
Because arr1 is a DataFrame, accessing a column will give us a Series. To get its constant value, then, we need to take the value in cell indexed by 0 - or the first row.
In your case this becomes:
>>> type(arr1), type(arr2)
(pandas.core.frame.DataFrame, pandas.core.frame.DataFrame)
>>> arr2["year"] = arr1["year"].loc[0]
>>> arr2["month"] = arr1["month"].loc[0]
>>> arr2["day"] = arr1["day"].loc[0]
>>> arr2
hours minutes seconds year month day
0 9 6 22.001464 2019 11 9
1 8 21 28.412044 2019 11 9
2 10 7 22.433552 2019 11 9
3 18 37 19.551359 2019 11 9
4 19 1 40.722019 2019 11 9
.. ... ... ... ... ... ...
95 2 16 48.368643 2019 11 9
96 19 22 25.034936 2019 11 9
97 10 0 20.163870 2019 11 9
98 16 35 27.251357 2019 11 9
99 8 26 54.200897 2019 11 9
Remember that this will work in-place, modifying arr2 object.
Accessing the numpy array behind the DataFrame
If you need the multidimensional array, you can just call:
>>> arr2_np = arr2.to_numpy()
Sorting columns based on your use-case
If you need to sort the columns, you can just take a different view of them, like this:
>>> cols = arr2.columns.to_list()
>>> cols2 = cols[3:] + cols[:3]
>>> arr2[cols2]
year month day hours minutes seconds
0 2019 11 9 9 6 22.001464
1 2019 11 9 8 21 28.412044
2 2019 11 9 10 7 22.433552
3 2019 11 9 18 37 19.551359
4 2019 11 9 19 1 40.722019
.. ... ... ... ... ... ...
95 2019 11 9 2 16 48.368643
96 2019 11 9 19 22 25.034936
97 2019 11 9 10 0 20.163870
98 2019 11 9 16 35 27.251357
99 2019 11 9 8 26 54.200897
this worked for me:
import numpy as np
arr1=[2019, 12, 17]
arr2=[12, 34, 17,
18, 17, 36,
15, 23, 40]
print(arr1,arr2)
output:
[2019, 12, 17] [12, 34, 17, 18, 17, 36, 15, 23, 40]
arr2 = np.array(arr2).reshape((3,3))
arr1 = np.array([arr1,]*3)
newArray = np.hstack((arr1,arr2))
output:
array([[2019, 12, 17, 12, 34, 17],
[2019, 12, 17, 18, 17, 36],
[2019, 12, 17, 15, 23, 40]])
update, to increase performance for large datasets, simple only stack new value after the array is once reshaped:
arr1=[2019, 12, 17]
newEntry = [1,2,3]
nE = np.hstack((arr1,newEntry))
np.vstack((newArray,nE))
output:
array([[2019, 12, 17, 12, 34, 17],
[2019, 12, 17, 18, 17, 36],
[2019, 12, 17, 15, 23, 40],
[2019, 12, 17, 1, 2, 3]])
update without knowledge of exakt dimension you can simply use:
np.arange(arr2).reshape(-1, 3)
You can use numpy.column_stack:
np.column_stack((array_1, array_2))
a
#array([[0, 1, 2],
# [3, 4, 5]])
b
#array([0, 1])
np.column_stack((a, b))
#array([[0, 1, 2, 0],
# [3, 4, 5, 1]])

how to shift single value of a pandas dataframe column

Using pandas first_valid_index() to get index of first non-null value of a column, how can I shifta single value of column rather than the whole column. i.e.
data = {'year': [2010, 2011, 2012, 2013, 2014, 2015, 2016,2017, 2018, 2019],
'columnA': [10, 21, 20, 10, 39, 30, 31,45, 23, 56],
'columnB': [None, None, None, 10, 39, 30, 31,45, 23, 56],
'total': [100, 200, 300, 400, 500, 600, 700,800, 900, 1000]}
df = pd.DataFrame(data)
df = df.set_index('year')
print df
columnA columnB total
year
2010 10 NaN 100
2011 21 NaN 200
2012 20 NaN 300
2013 10 10 400
2014 39 39 500
2015 30 30 600
2016 31 31 700
2017 45 45 800
2018 23 23 900
2019 56 56 1000
for col in df.columns:
if col not in ['total']:
idx = df[col].first_valid_index()
df.loc[idx, col] = df.loc[idx, col] + df.loc[idx, 'total'].shift(1)
print df
AttributeError: 'numpy.float64' object has no attribute 'shift'
desired result:
print df
columnA columnB total
year
2010 10 NaN 100
2011 21 NaN 200
2012 20 NaN 300
2013 10 310 400
2014 39 39 500
2015 30 30 600
2016 31 31 700
2017 45 45 800
2018 23 23 900
2019 56 56 1000
is that what you want?
In [63]: idx = df.columnB.first_valid_index()
In [64]: df.loc[idx, 'columnB'] += df.total.shift().loc[idx]
In [65]: df
Out[65]:
columnA columnB total
year
2010 10 NaN 100
2011 21 NaN 200
2012 20 NaN 300
2013 10 310.0 400
2014 39 39.0 500
2015 30 30.0 600
2016 31 31.0 700
2017 45 45.0 800
2018 23 23.0 900
2019 56 56.0 1000
UPDATE: starting from Pandas 0.20.1 the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers.
You can filter all column names, where is least one NaN value and then use union with column total:
for col in df.columns:
if col not in pd.Index(['total']).union(df.columns[~df.isnull().any()]):
idx = df[col].first_valid_index()
df.loc[idx, col] += df.total.shift().loc[idx]
print (df)
columnA columnB total
year
2010 10 NaN 100
2011 21 NaN 200
2012 20 NaN 300
2013 10 310.0 400
2014 39 39.0 500
2015 30 30.0 600
2016 31 31.0 700
2017 45 45.0 800
2018 23 23.0 900
2019 56 56.0 1000

Split one column of csv file based on another column

I am trying to split a csv file of temperature data into smaller dictionaries so I can calculate the mean temperature of each month. The csv file is of the format below:
AirTemperature AirHumidity SoilTemperature SoilMoisture LightIntensity WindSpeed Year Month Day Hour Minute Second TimeStamp MonthCategorical
12 68 19 65 60 2 2016 1 1 0 1 1 10100 January
18 34 14 42 19 0 2016 1 1 1 1 1 10101 January
19 98 14 41 30 4 2016 1 1 2 1 1 10102 January
16 88 16 68 54 4 2016 1 1 3 1 1 10103 January
16 44 20 41 10 1 2016 1 1 4 1 1 10104 January
22 54 18 65 94 0 2016 1 1 5 1 1 10105 January
18 84 17 41 40 4 2016 1 1 6 1 1 10106 January
20 88 22 92 31 0 2016 1 1 7 1 1 10107 January
23 1 22 59 3 0 2016 1 1 8 1 1 10108 January
23 3 22 72 41 4 2016 1 1 9 1 1 10109 January
24 63 23 83 85 0 2016 1 1 10 1 1 10110 January
29 73 27 50 1 4 2016 1 1 11 1 1 10111 January
28 37 30 46 29 3 2016 1 1 12 1 1 10112 January
30 99 32 78 73 4 2016 1 1 13 1 1 10113 January
32 72 31 80 80 1 2016 1 1 14 1 1 10114 January
Where there are 24 readings per day over a 6 month period.
I can get half way there with the following code:
for row in df['AirTemperature']:
for equivalentRow in df['MonthCategorical']:
if equivalentRow == "January":
JanuaryAirTemperatures.append(row)
But the output of this has every AirTemp value duplicated by the number of rows containing the value January. I.e. rather than 12,18,19 etc it goes 12, 12, 12, 12, 12, 18, 18, 18, 18, 18, 19, 19, 19, 19
I tried the following:
for row in df['AirTemperature']:
if df['MonthCategorical'] == "January":
JanuaryAirTemperatures.append(row)
But I get the following error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). I assume because it is trying to look at the whole column rather than the equivalent row.
IIUC, you can groupby by month and get the mean value of the Air Temperature per month with:
g = df.groupby('MonthCategorical')['AirTemperature'].mean().reset_index(name='MeanAirTemperature')
this returns:
MonthCategorical MeanAirTemperature
0 January 22
Then you can choose on what columns you want to groupby (i.e. instead of MonthCategorical you can group by Month only...).
EDIT:
You can also use transform to get a new column to append to the original dataframe with:
df['MeanAirTemperature'] = df.groupby('MonthCategorical')['AirTemperature'].transform('mean')

pandas: conditionally return a column's value

I am trying to make a new column called 'wage_rate' that fills in the appropriate wage rate for the employee based on the year of the observation.
In other words, my list looks something like this:
eecode year w2011 w2012 w2013
1 2012 7 8 9
1 2013 7 8 9
2 2011 20 25 25
2 2012 20 25 25
2 2013 20 25 25
And I want return, in a new column, 8 for the first row, 9 for the second, 20, 25, 25.
One way would be to use apply by constructing column name for each row based on year like 'w' + str(x.year).
In [41]: df.apply(lambda x: x['w' + str(x.year)], axis=1)
Out[41]:
0 8
1 9
2 20
3 25
4 25
dtype: int64
Details:
In [42]: df
Out[42]:
eecode year w2011 w2012 w2013
0 1 2012 7 8 9
1 1 2013 7 8 9
2 2 2011 20 25 25
3 2 2012 20 25 25
4 2 2013 20 25 25
In [43]: df['wage_rate'] = df.apply(lambda x: x['w' + str(x.year)], axis=1)
In [44]: df
Out[44]:
eecode year w2011 w2012 w2013 wage_rate
0 1 2012 7 8 9 8
1 1 2013 7 8 9 9
2 2 2011 20 25 25 20
3 2 2012 20 25 25 25
4 2 2013 20 25 25 25
values = [ row['w%s'% row['year']] for key, row in df.iterrows() ]
df['wage_rate'] = values # create the new columns
This solution is using an explicit loop, thus is likely slower than other pure-pandas solutions, but on the other hand it is simple and readable.
you can rename columns names to be the same as year columns using replace
In [70]:
df.columns = [re.sub('w(?=\d+4$)' , '' , column ) for column in df.columns ]
In [80]:
df.columns
Out[80]:
Index([u'eecode', u'year', u'2011', u'2012', u'2013', u'wage_rate'], dtype='object')
then get the value using the following
df['wage_rate'] = df.apply(lambda x : x[str(x.year)] , axis = 1)
Out[79]:
eecode year 2011 2012 2013 wage_rate
1 2012 7 8 9 8
1 2013 7 8 9 9
2 2011 20 25 25 20
2 2012 20 25 25 25
2 2013 20 25 25 25

Categories

Resources