Filling a DataFrame based on conditions for both columns and rows

Filling a DataFrame based on conditions for both columns and rows - python

I have a dataframe (df_1) which contains coordinates and value data with no order that looks like this:
x_grid
y_grid
n_value
0
204.0
32.0
45
1
204.0
33.0
32
2
204.0
34.0
94
3
204.0
35.0
92
4
204.0
36.0
84
I wanted to shape in into another dataframe (df_2) to be able to create a heatmap. So I created an empty dataframe where the column indexes are the x_grid values and row indexes are y_grid values.
Then in a for loop I tried I performed an operation where I tried if the row index is equal to x_grid value then change the column with the index of the y_grid value into the n_value.
Here is my code:
for i, row in enumerate(df_2.iterrows()):
row_ind = index_list[i]
for j, item in enumerate(df_1.iterrows()):
x_ind = item[1].x_grid
if row_ind == x_ind:
col_ind = item[1].y_grid
row[1].col_ind = item[1].n_value
What I run this loop I see that there are new values filling dataframe but it does not seem right. The coordinates and values in the second dataframe do not match with the first one.
Second dataframe (df_2) partially looks something like this:
0
25
26
27
0
0
0
27
0
195
0
0
32
36
196
0
65
0
0
197
0
0
0
24
198
0
73
58
0
Is it a better way to perform this? I would also appreciate any other methods to turn the initial dataframe into a heatmap.

IIUC:
df_2 = df_1.pivot('x_grid', 'y_grid', 'n_value') \
.reindex(index=pd.RangeIndex(0, df_1['y_grid'].max()+1),
columns=pd.RangeIndex(0, df_1['x_grid'].max()+1),
fill_value=0)
If you have duplicated values for the same (x, y), use pivot_table:
df_2 = df_1.pivot_table('n_value', 'x_grid', 'y_grid', aggfunc='mean') \
.reindex(index=pd.RangeIndex(df_1['y_grid'].min(), df_1['y_grid'].max()+1),
columns=pd.RangeIndex(df_1['x_grid'].min(), df_1['x_grid'].max()+1))
Example:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
np.random.seed(2022)
df_1 = pd.DataFrame(np.random.randint(0, 20, (1000, 3)),
columns=['x_grid', 'y_grid', 'n_value'])
df_2 = df_1.pivot_table('n_value', 'x_grid', 'y_grid', aggfunc='mean') \
.reindex(index=pd.RangeIndex(df_1['y_grid'].min(), df_1['y_grid'].max()+1),
columns=pd.RangeIndex(df_1['x_grid'].min(), df_1['x_grid'].max()+1))
sns.heatmap(df_2, vmin=0, vmax=df_1['n_value'].max())
plt.show()

Related

Pandas Pct Change Between 2 Columns, Replacing Original

(I think) I'm looking to apply a transform to a column, by finding the % change when compared to one static column.
My first attempt looks like this (without a transform):
from pandas import DataFrame
from numpy import random
df = DataFrame(random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
print(df)
for col in df.columns:
# Find the % increase/decrease of "col" compared to column A
df[col] = df[["A", col]].pct_change(axis=1)[col]
print(df)
...however the resulting df is all NaN's when I'm expecting it to be in % increase/decrease format.
So as an example, it starts by comparing column A with column A, that's fine, all values SHOULD be the same. Then the next iteration it should be column B compared to column A. We should see %'s in column B in the end. Then same for C and D. I'm just new to transforms/changing values of a column in place and not sure how to do it.

Subtract column A from the dataframe then divide by column A to calculate pct_change:
df.sub(df['A'], axis=0).div(df['A'], axis=0)
The above expression can be further simplied to:
df.div(df['A'], axis=0).sub(1)
A B C D
0 0.0 -0.821429 1.535714 0.500000
1 0.0 0.491525 0.508475 -0.745763
2 0.0 -0.452055 0.013699 -0.452055
3 0.0 2.187500 0.062500 0.812500
4 0.0 -0.632184 -0.839080 0.114943
5 0.0 -0.042105 -0.378947 -0.157895
6 0.0 -0.553191 -0.734043 -0.319149
...
98 0.0 -0.604651 -0.325581 -0.418605
99 0.0 0.649123 -0.964912 -0.631579

I think the problem is that you can not have two columns 'A' in df[["A", col]]. When you change df.columns to df.columns[1:] it runs without errors.
for col in df.columns[1:]:
# Find the % increase/decrease of "col" compared to column A
df[col] = df[["A", col]].pct_change(axis=1)[col]
print(df)
Result:
A B C D
0 33 -0.974288 1.757576 0.575758
1 74 -1.010044 -0.945946 -0.797297
2 62 -1.015869 0.064516 -0.145161
3 53 -0.998932 0.377358 0.075472
4 97 -1.010203 -0.948454 -0.340206
.. .. ... ... ...
95 88 -0.998838 -0.102273 -0.704545
96 59 -1.009193 -0.983051 -0.525424
97 52 -1.011464 0.134615 -0.903846
98 0 inf inf inf
99 33 -0.979798 -0.181818 -0.303030

Is it possible to find peaks in a csv file and add the peaks in a separate column in the same csv file? (without plotting)

This is a piece of my code, but I don't know how to get this array as new column Height to the original csv file in the format?
Date
Level
Height
01-01-2021
45
0
02-01-2021
43
0
03-01-2021
47
1
04-01-2021
46
0
.....
import pandas as pd
from scipy.signal import find_peaks
import matplotlib.pyplot as plt
bestand = pd.read_csv('file.csv', skiprows=1, usecols=[0, 1] , names=['date', 'level'])
bestand = bestand['level']
indices = find_peaks(bestand, height= 37, threshold=None, distance=None)
height = indices[1]['peak_heights']
print(height)

I think you want to assign a column named height that takes the value 1 when level is a peak according to find_peaks(). If so:
# Declare column full of zeros
bestand['height'] = 0
# Get row number of observations that are peaks
idx = find_peaks(x=bestand['level'], height=37)[0]
# Select rows in `idx` and replace their `height` with 1
bestand.iloc[idx, 2] = 1
Which returns this:
date level height
0 01-01-2021 45 0
1 02-01-2021 43 0
2 03-01-2021 47 1
3 04-01-2021 46 0

I'am not sure I understood your question.
You just want to save the results?
bastand['Heigth'] = indices
bastand.to_csv('file.csv')

Populate pandas dataframe using column and row indices as variables

Overview
How do you populate a pandas dataframe using math which uses column and row indices as variables.
Setup
import pandas as pd
import numpy as np
df = pd.DataFrame(index = range(5), columns = ['Combo_Class0', 'Combo_Class1', 'Combo_Class2', 'Combo_Class3', 'Combo_Class4'])
Objective
Each cell in df = row index * (column index + 2)
Attempt 1
You can use this solution to produce the following code:
row = 0
for i in range(5):
row = row + 1
df.loc[i] = [(row)*(1+2), (row)*(2+2), (row)*(3+2), (row)*(4+2), (row)*(4+2), (row)*(5+2)]
Attempt 2
This solution seemed relevant as well, although I believe I've read you're not supposed to loop through dataframes. Besides, I'm not seeing how to loop through rows and columns:
for i, j in df.iterrows():
df.loc[i] = i

You can leverage broadcasting for a more efficient approach:
ix = (df.index+1).to_numpy() # .values for pandas 0.24<
df[:] = ix[:,None] * (ix+2)
print(df)
Combo_Class0 Combo_Class1 Combo_Class2 Combo_Class3 Combo_Class4
0 3 4 5 6 7
1 6 8 10 12 14
2 9 12 15 18 21
3 12 16 20 24 28
4 15 20 25 30 35

Using multiply outer
df[:]=np.multiply.outer((np.arange(5)+1),(np.arange(5)+3))

Multiply entire column by a random number and store it as new column

I have a column with 100 rows and i want to generate multiple columns(say 100) from this column. These new columns should be generated by multiplying the first column with a random value. Is there a way to do it using python? I have tried it using excel but that is a tedious task as for every column I have to multiply the column with a randomly generated number (randbetween(a,b)).

Let's assume you have a column of numeric data:
import numpy as np
import random
# random.randint(a,b) will choose a random integer between a and b
# this will create a column that is 96 elements long
col = [random.randint(0,500) for i in range(96)]
Now, let's create more columns by leveraging a numpy.array which supports scalar multiplication of vectors:
arr = np.array(col)
# our dataframe has one column in it
df = pd.DataFrame(arr, columns=['x'])
a, b = 100, 5000 # set what interval to select random numbers from
Now, you can loop through to add in new columns
num_cols = 99
for i in range(num_cols): # or however many columns you want to add
df[i] = df.x * random.randint(a,b)
df.head()
x 0 1 2 3 4 5 6 ... 92 93 94 95 96 97 98 99
0 68 257040 214268 107576 266152 229568 309468 319668 ... 74460 25024 85952 320620 331840 175712 87788 254864
1 286 1081080 901186 452452 1119404 965536 1301586 1344486 ... 313170 105248 361504 1348490 1395680 739024 369226 1071928
2 421 1591380 1326571 666022 1647794 1421296 1915971 1979121 ... 460995 154928 532144 1985015 2054480 1087864 543511 1577908
3 13 49140 40963 20566 50882 43888 59163 61113 ... 14235 4784 16432 61295 63440 33592 16783 48724
4 344 1300320 1083944 544208 1346416 1161344 1565544 1617144 ... 376680 126592 434816 1621960 1678720 888896 444104 1289312
[5 rows x 101 columns]

You can use Numpy reshape to multiply column with random number
a, b = 10 ,20
df = pd.DataFrame({'col':np.random.randint(0,500, 100)})
df['col'].values * np.random.randint(a, b, 100).reshape(-1,1)
To get the result in a Dataframe,
pd.DataFrame(df['col'].values * np.random.randint(a, b, 100).reshape(-1,1))

Plotting based on column criteria in Pandas

If i have a dataframe say
df = {'carx' : [merc,rari,merc,hond,fia,merc]
'cary' : [bent,maz,ben,merc,fia,fia]
'milesx' : [0,100,2,22,5,6]
'milesy' : [10,3,18,2,19,2]}
I then would like to plot the value from column milesx if corresponding index of carx has the value 'merc'. The same criteria applies for cary and milesy, else nothing should be plotted. How can i do this?
milesy and milesx should be plotted on the x-axis. The y-axis should just be some continuous values (1,2...).

IIUC, assuming you have following dataframe:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# input dictionary
df = {'carx' : ['merc','rari','merc','hond','fia','merc'],
'cary' : ['bent','maz','ben','merc','fia','fia'],
'milesx' : [0,100,2,22,5,6],
'milesy' : [10,3,18,2,19,2]}
# creating input dataframe
dataframe = pd.DataFrame(df)
print(dataframe)
Result:
carx cary milesx milesy
0 merc bent 0 10
1 rari maz 100 3
2 merc ben 2 18
3 hond merc 22 2
4 fia fia 5 19
5 merc fia 6 2
Then, you want to plot values given condition which can be done using function, and using apply:
def my_function(row):
if row['carx'] == 'merc':return row['milesx']
if row['cary'] == 'merc': return row['milesy']
else: return None
# filter those with only 'merc'
filtered = dataframe.apply(lambda row: my_function(row), axis=1)
print(filtered)
Result:
0 0.0
1 NaN
2 2.0
3 2.0
4 NaN
5 6.0
dtype: float64
You do not want to plot when neither of them are which would be NaN, so dropna() may be used:
# plotting
filtered.dropna().plot(kind='bar', legend=None);

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Filling a DataFrame based on conditions for both columns and rows - python

Related

Pandas Pct Change Between 2 Columns, Replacing Original

Is it possible to find peaks in a csv file and add the peaks in a separate column in the same csv file? (without plotting)

Populate pandas dataframe using column and row indices as variables

Multiply entire column by a random number and store it as new column

Plotting based on column criteria in Pandas

Categories

Resources