This is my sample code. My database contains columns for every date of the year, going back multiple years. Each column corresponds to a specific date.
import pandas as pd
df = pd.DataFrame([[10, 5, 25, 67,25,56],
[20, 10, 26, 45, 56, 34],
[30, 3, 27, 34, 78, 34],
[40, 9, 28, 45, 34,76]],
columns=[pd.to_datetime('2022-09-14'), pd.to_datetime('2022-08-14'), pd.to_datetime('2022-07-14'), pd.to_datetime('2021-09-14'),
pd.to_datetime('2020-09-14'), pd.to_datetime('2019-09-14')])
Is there a way to select only those columns which fit a particular criteria based on year, month or quarter.
For example, I was hoping to get only those columns which is the same date as today (any starting date) for every year. For example, today is Sep 14, 2022 and I need columns only for Sep 14, 2021, Sep 14, 2020 and so on. Another option could be to do the same on a month or quarter basis.
How can this be done in pandas?
Yes, you can do:
# day
df.loc[:, df.columns.day == 14]
2022-09-14 2022-08-14 2022-07-14 2021-09-14 2020-09-14 2019-09-14
0 10 5 25 67 25 56
1 20 10 26 45 56 34
2 30 3 27 34 78 34
3 40 9 28 45 34 76
# month
df.loc[:, df.columns.month == 9]
2022-09-14 2021-09-14 2020-09-14 2019-09-14
0 10 67 25 56
1 20 45 56 34
2 30 34 78 34
3 40 45 34 76
# quarter
df.loc[:, df.columns.quarter == 3]
2022-09-14 2022-08-14 2022-07-14 2021-09-14 2020-09-14 2019-09-14
0 10 5 25 67 25 56
1 20 10 26 45 56 34
2 30 3 27 34 78 34
3 40 9 28 45 34 76
Related
I am trying to run a nested loop in which I want the output to be saved in four different columns. Let C1R1 be the value I want in the first column first row, C2R2 the one I want in the second column second row, etc. What I have come up with this far gives me a list where the output is saved like this:
['C1R1', 'C2R1', 'C3R1', 'C4R1']. This is the code I am using:
dfs1 = []
for i in range(24):
pd = (data_json2['data']['Rows'][i])
for j in range(4):
pd1 = pd['Columns'][j]['Value']
dfs1.append(pd1)
What could be a good way to achieve this?
EDIT: This is what I want to achieve:
Column 1 Column 2 Column 3 Column 4
0 0 24 48 72
1 1 25 49 73
2 2 26 50 74
3 3 27 51 75
4 4 28 52 76
5 5 29 53 77
6 6 30 54 78
7 7 31 55 79
8 8 32 56 80
9 9 33 57 81
10 10 34 58 82
11 11 35 59 83
12 12 36 60 84
13 13 37 61 85
14 14 38 62 86
15 15 39 63 87
16 16 40 64 88
17 17 41 65 89
18 18 42 66 90
19 19 43 67 91
20 20 44 68 92
21 21 45 69 93
22 22 46 70 94
23 23 47 71 95
While this is what I got now:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39]
Thank you.
Try:
import pandas as pd
def get_dataframe(num_cols=4, num_values=24):
return pd.DataFrame(
([v * 24 + c for v in range(num_cols)] for c in range(num_values)),
columns=[f"Column {c}" for c in range(1, num_cols + 1)],
)
df = get_dataframe()
print(df)
Prints:
Column 1 Column 2 Column 3 Column 4
0 0 24 48 72
1 1 25 49 73
2 2 26 50 74
3 3 27 51 75
4 4 28 52 76
5 5 29 53 77
6 6 30 54 78
7 7 31 55 79
8 8 32 56 80
9 9 33 57 81
10 10 34 58 82
11 11 35 59 83
12 12 36 60 84
13 13 37 61 85
14 14 38 62 86
15 15 39 63 87
16 16 40 64 88
17 17 41 65 89
18 18 42 66 90
19 19 43 67 91
20 20 44 68 92
21 21 45 69 93
22 22 46 70 94
23 23 47 71 95
Economy Year Indicator1 Indicator2 Indicator3 Indicator4 .
UK 1 23 45 56 78
UK 2 24 87 32 42
UK 3 22 87 32 42
UK 4 2 87 32 42
FR . . . . .
This is my data which extends on and held as a DataFrame, I want to switch the Header(Indicators) and the Year columns, seems like a pivot. There are hundreds of indicators and 20 years.
Use DataFrame.melt with DataFrame.pivot:
df = (df.melt(['Economy','Year'], var_name='Ind')
.pivot(['Economy','Ind'], 'Year', 'value')
.reset_index()
.rename_axis(None, axis=1))
print (df)
Economy Ind 1 2 3 4
0 UK Indicator1 23 24 22 2
1 UK Indicator2 45 87 87 87
2 UK Indicator3 56 32 32 32
3 UK Indicator4 78 42 42 42
Another option is to set Year column as index and then use transpose.
Consider the code below:
import pandas as pd
df = pd.DataFrame(columns=['Economy', 'Year', 'Indicator1', 'Indicator2', 'Indicator3', 'Indicator4'],
data=[['UK', 1, 23, 45, 56, 78],['UK', 2, 24, 87, 32, 42],['UK', 3, 22, 87, 32, 42],['UK', 4, 2, 87, 32, 42],
['FR', 1, 22, 33, 11, 35]])
# Make Year column as index
df = df.set_index('Year')
# Transpose columns to rows and vice-versa
df = df.transpose()
print(df)
gives you
Year 1 2 3 4 1
Economy UK UK UK UK FR
Indicator1 23 24 22 2 22
Indicator2 45 87 87 87 33
Indicator3 56 32 32 32 11
Indicator4 78 42 42 42 35
You can use transpose
like this :
df = df.set_index('Year')
df = df.transpose()
print (df)
I require to concat two arrays of unequal size:
Array-1:
A = ["year","month","day","hour","minute","second", "a", "b", "c", "d"]
data1 = pd.read_csv('event_5.txt',sep='\t',names=A)
array1=data1[['year', 'month', 'day']]
Array-2:
B=["station", "phase", "hour", "minute", "second"]
arr_data = pd.read_csv('arrival_5.txt',sep='\t',names=B)
ar_t= arr_data[['hour', 'minute', 'second']]
array2 = pd.DataFrame(ar_t)
The required output is shown below: here, [2019 11 9] is the array-1 reshaped to match the dimensions of the second array and then concat. However, in the case of reshaping, I need to check the dimensions of the second array every time. Therefore, I need an automated script that can concat unequal arrays.
Array-1: The first array always have the same dimensions
year month day
0 2019 11 9
Array-2: Variable dimension columns are fixed but rows change for each iteration:
hour minute second
0 14 57 41.80
1 14 58 3.47
2 14 57 25.99
3 14 57 37.00
4 14 57 29.86
5 14 57 40.24
6 14 57 32.61
7 14 57 42.26
8 14 57 29.74
9 14 57 42.36
10 14 57 46.00
11 14 58 8.69
12 14 57 34.50
13 14 57 48.97
14 14 57 30.30
15 14 57 39.78
16 14 57 32.45
17 14 57 47.83
18 14 57 25.86
19 14 57 36.30
20 14 57 17.90
21 14 57 23.40
22 14 57 34.64
23 14 57 50.95
24 14 57 35.90
25 14 57 50.64
Required output:
Year month day hour minute second
0 2019 11 9 14 57 41.80
1 2019 11 9 14 58 3.47
2 2019 11 9 14 57 25.99
3 2019 11 9 14 57 37.00
4 2019 11 9 14 57 29.86
5 2019 11 9 14 57 40.24
6 2019 11 9 14 57 32.61
7 2019 11 9 14 57 42.26
8 2019 11 9 14 57 29.74
9 2019 11 9 14 57 42.36
10 2019 11 9 14 57 46.00
11 2019 11 9 14 58 8.69
12 2019 11 9 14 57 34.50
13 2019 11 9 14 57 48.97
14 2019 11 9 14 57 30.30
15 2019 11 9 14 57 39.78
16 2019 11 9 14 57 32.45
17 2019 11 9 14 57 47.83
18 2019 11 9 14 57 25.86
19 2019 11 9 14 57 36.30
20 2019 11 9 14 57 17.90
21 2019 11 9 14 57 23.40
22 2019 11 9 14 57 34.64
23 2019 11 9 14 57 50.95
24 2019 11 9 14 57 35.90
25 2019 11 9 14 57 50.64
Assigning a constant value to a DataFrame column
If your first array is always a single-row dataframe, or a monodimensional array, then you can just use pandas to assign a constant value to a column.
The syntax is my_dataframe["new_column"] = constant_value.
Because arr1 is a DataFrame, accessing a column will give us a Series. To get its constant value, then, we need to take the value in cell indexed by 0 - or the first row.
In your case this becomes:
>>> type(arr1), type(arr2)
(pandas.core.frame.DataFrame, pandas.core.frame.DataFrame)
>>> arr2["year"] = arr1["year"].loc[0]
>>> arr2["month"] = arr1["month"].loc[0]
>>> arr2["day"] = arr1["day"].loc[0]
>>> arr2
hours minutes seconds year month day
0 9 6 22.001464 2019 11 9
1 8 21 28.412044 2019 11 9
2 10 7 22.433552 2019 11 9
3 18 37 19.551359 2019 11 9
4 19 1 40.722019 2019 11 9
.. ... ... ... ... ... ...
95 2 16 48.368643 2019 11 9
96 19 22 25.034936 2019 11 9
97 10 0 20.163870 2019 11 9
98 16 35 27.251357 2019 11 9
99 8 26 54.200897 2019 11 9
Remember that this will work in-place, modifying arr2 object.
Accessing the numpy array behind the DataFrame
If you need the multidimensional array, you can just call:
>>> arr2_np = arr2.to_numpy()
Sorting columns based on your use-case
If you need to sort the columns, you can just take a different view of them, like this:
>>> cols = arr2.columns.to_list()
>>> cols2 = cols[3:] + cols[:3]
>>> arr2[cols2]
year month day hours minutes seconds
0 2019 11 9 9 6 22.001464
1 2019 11 9 8 21 28.412044
2 2019 11 9 10 7 22.433552
3 2019 11 9 18 37 19.551359
4 2019 11 9 19 1 40.722019
.. ... ... ... ... ... ...
95 2019 11 9 2 16 48.368643
96 2019 11 9 19 22 25.034936
97 2019 11 9 10 0 20.163870
98 2019 11 9 16 35 27.251357
99 2019 11 9 8 26 54.200897
this worked for me:
import numpy as np
arr1=[2019, 12, 17]
arr2=[12, 34, 17,
18, 17, 36,
15, 23, 40]
print(arr1,arr2)
output:
[2019, 12, 17] [12, 34, 17, 18, 17, 36, 15, 23, 40]
arr2 = np.array(arr2).reshape((3,3))
arr1 = np.array([arr1,]*3)
newArray = np.hstack((arr1,arr2))
output:
array([[2019, 12, 17, 12, 34, 17],
[2019, 12, 17, 18, 17, 36],
[2019, 12, 17, 15, 23, 40]])
update, to increase performance for large datasets, simple only stack new value after the array is once reshaped:
arr1=[2019, 12, 17]
newEntry = [1,2,3]
nE = np.hstack((arr1,newEntry))
np.vstack((newArray,nE))
output:
array([[2019, 12, 17, 12, 34, 17],
[2019, 12, 17, 18, 17, 36],
[2019, 12, 17, 15, 23, 40],
[2019, 12, 17, 1, 2, 3]])
update without knowledge of exakt dimension you can simply use:
np.arange(arr2).reshape(-1, 3)
You can use numpy.column_stack:
np.column_stack((array_1, array_2))
a
#array([[0, 1, 2],
# [3, 4, 5]])
b
#array([0, 1])
np.column_stack((a, b))
#array([[0, 1, 2, 0],
# [3, 4, 5, 1]])
I'm after a pythonic and pandemic (from pandas, pun not intended =) way to pivot some rows in a dataframe into new columns.
My data has this format:
dof foo bar qux
idxA idxB
100 101 1 10 30 50
101 2 11 31 51
101 3 12 32 52
102 1 13 33 53
102 2 14 34 54
102 3 15 35 55
200 101 1 16 36 56
101 2 17 37 57
101 3 18 38 58
102 1 19 39 59
102 2 20 40 60
102 3 21 41 61
The variables foo, bar and qux actually have 3 dimensional coordinates, which I would like to call foo1, foo2, foo3, bar1, ..., qux3. These are identified by the column dof. Each row represents one axis in 3D, dof == 1 is the x axis, dof == 2 the y axis and dof == 3 is the z axis.
So, here is the final dataframe I want:
foo1 bar1 qux1 foo2 bar2 qux2 foo3 bar3 qux3
idxA idxB
100 101 10 30 50 11 31 51 12 32 52
102 13 33 53 14 34 54 15 35 55
200 101 16 36 56 17 37 57 18 38 58
102 19 39 59 20 40 60 21 41 61
Question is: what is the best way to do that?
Here is what I have done.
A code to re-create my dataset in a dataframe:
import pandas as pd
data = [[100, 101, 1, 10, 30, 50],
[100, 101, 2, 11, 31, 51],
[100, 101, 3, 12, 32, 52],
[100, 102, 1, 13, 33, 53],
[100, 102, 2, 14, 34, 54],
[100, 102, 3, 15, 35, 55],
[200, 101, 1, 16, 36, 56],
[200, 101, 2, 17, 37, 57],
[200, 101, 3, 18, 38, 58],
[200, 102, 1, 19, 39, 59],
[200, 102, 2, 20, 40, 60],
[200, 102, 3, 21, 41, 61],
]
df = pd.DataFrame(data=data, columns=['idxA', 'idxB', 'dof', 'foo', 'bar', 'qux'])
df.set_index(['idxA', 'idxB'], inplace=True)
Code to do what I would like to do:
df2 = df[df.dof == 1].reset_index()[['idxA', 'idxB']]
df2.set_index(['idxA', 'idxB'], inplace=True)
for pivot in [1, 2, 3]:
df2.loc[:, 'foo%d' % pivot] = df[df.dof == pivot]['foo']
df2.loc[:, 'bar%d' % pivot] = df[df.dof == pivot]['bar']
df2.loc[:, 'qux%d' % pivot] = df[df.dof == pivot]['qux']
However I'm not too happy with these .loc calls and incremental column additions to the dataframe. I thought that pandas being awesome as it is would have a neater way of doing that. A one-liner would be super cool.
You can try df.pivot
df = df.pivot(columns='dof')
foo bar qux
dof 1 2 3 1 2 3 1 2 3
idxA idxB
100 101 10 11 12 30 31 32 50 51 52
102 13 14 15 33 34 35 53 54 55
200 101 16 17 18 36 37 38 56 57 58
102 19 20 21 39 40 41 59 60 61
Now join using df.columns
df.columns = df.columns.map('{0[0]}{0[1]}'.format) #suggested by #YOBEN_S
foo1 foo2 foo3 bar1 bar2 bar3 qux1 qux2 qux3
idxA idxB
100 101 10 11 12 30 31 32 50 51 52
102 13 14 15 33 34 35 53 54 55
200 101 16 17 18 36 37 38 56 57 58
102 19 20 21 39 40 41 59 60 61
You can add dof to index and do a unstack:
new_df = df.set_index('dof',append=True).unstack('dof')
new_df.columns = [f'{x}{y}' for x,y in new_df.columns]
Output:
foo1 foo2 foo3 bar1 bar2 bar3 qux1 qux2 qux3
idxA idxB
100 101 10 11 12 30 31 32 50 51 52
102 13 14 15 33 34 35 53 54 55
200 101 16 17 18 36 37 38 56 57 58
102 19 20 21 39 40 41 59 60 61
I have a dataframe, dataframe_1, that looks like this:
0 1 2 3 4 5 ... 192
0 12 35 60 78 23 90 32
And another dataframe, dataframe_2, that looks like this:
58 59 60 61 62 ... 350
0 1 4 192 4 4 1
1 0 3 3 5 3 4
2 3 1 4 2 2 192
The values in dataframe_2 are the column names from dataframe_1. What I'd like to do is change the values in dataframe_2 based on the column names of dataframe_1, like so:
58 59 60 61 62 ... 350
0 35 23 32 23 23 35
1 12 78 78 90 78 23
2 78 35 23 60 60 32
I tried a for loop using .loc, but it did not work.
Any help is greatly appreciated!
Using replace
d2.replace(dict(zip(d1.columns,d1.iloc[0])))
stack and map
# if necessary, cast,
# df1.columns = df1.columns.astype(int)
df2.stack().map(df1.iloc[0]).unstack()
58 59 60 61 62 350
0 35 23 32 23 23 35
1 12 78 78 90 78 23
2 78 35 23 60 60 32
Stack df2 so we can call Series.map to perform a single vectorised replacement using df1.
apply and map
df2.apply(pd.Series.map, args=(df1.iloc[0],))
58 59 60 61 62 350
0 35 23 32 23 23 35
1 12 78 78 90 78 23
2 78 35 23 60 60 32
Instead of stacking to get a Series, we apply a map operation across each column.
You could define a dictionary from df1 and use it to replace replace the values in df2:
d = dict(zip(df1.columns, df1.values.ravel()))
df2.replace(d)
58 59 60 61 62 350
0 35 23 32 23 23 35
1 12 78 78 90 78 23
2 78 35 23 60 60 32
Or stacking df1 and then replacing:
df2.replace(df1.stack().droplevel(0))
58 59 60 61 62 350
0 35 23 32 23 23 35
1 12 78 78 90 78 23
2 78 35 23 60 60 32
Create a lookup table and map the values using the underlying numpy array. This assumes integer column names.
u = np.zeros(df1.columns.max()+1, dtype=int)
u[df1.columns] = df1.iloc[0].values
u[df2.values]
array([[35, 23, 32, 23, 23, 35],
[12, 78, 78, 90, 78, 23],
[78, 35, 23, 60, 60, 32]])
If there are values that might not match a value in df1:
u = np.full(df1.columns.max()+1, np.nan)
u[df1.columns] = df1.iloc[0].values
u[df2.values]
And then fillna with df2 if desired.
df2.applymap(lambda x: df1.loc[0,x])