Iterate over rows of Pandas dataframe - python

I have df like:
CELLID lon lat METER LATITUDE_SM LONGITUDE_SM Path_ID
2557709 5.286339 51.353820 E0047000004028217 51.3501 5.3125 2557709_E0047000004028217
For each Path_ID(str) I would like to iterate the loop and would like to achieve df1 like:
Path_ID METER LATITUDE_SM LONGITUDE_SM
2557709_E0047000004028217 E0047000004028217 51.3501 5.3125
Path_ID CELLID Lon lat
2557709_E0047000004028217 2557709 5.286339 51.353820
I have many rows in the df.
I am doing something like
for row in df.iterrows():
print row ['Path_ID'],row['METER'],row['LATITUDE_SM'], row ['LONGITUDE_SM']

It is very hard to understand your goal, but IIUC, you want to group by Path_ID and print each value
grouped_df= df.groupby("Path_ID")[["Path_ID", "METER", "LATITUDE_SM", "LONGITUDE_SM"]]
for key, val in grouped_df:
print grouped_df.get_group(key), "\n"
Output
Path_ID METER LATITUDE_SM LONGITUDE_SM
0 2557709_E0047000004028217 E0047000004028217 51.35 5.3125

It is unclear why you want this behaviour, but you can achieve this with pd.DataFrame.iloc.
If you only need specific columns, replace : with a list of column numbers.
import pandas as pd, numpy as np
df = pd.DataFrame(np.random.random((5, 5)))
for i in range(len(df.index)):
print(df.iloc[[i], :])
# 0 1 2 3 4
# 0 0.587349 0.947435 0.974285 0.498303 0.135898
# 0 1 2 3 4
# 1 0.292748 0.880276 0.522478 0.081902 0.187494
# 0 1 2 3 4
# 2 0.692022 0.908397 0.200202 0.099722 0.348589
# 0 1 2 3 4
# 3 0.041564 0.980425 0.899634 0.725757 0.569983
# 0 1 2 3 4
# 4 0.787038 0.000077 0.213646 0.444095 0.022923

Related

How to identify and highlight outliers in each row of a pandas dataframe

I want to do the following to my dataframe:
For each row identify outliers/anomalies
Highlight/color the identified outliers' cells (preferably 'red' color)
Count the number of identified outliers in each row (store in a column 'anomaly_count')
Export the output as an xlsx file
See below for sample data
np.random.seed([5, 1591])
df = pd.DataFrame(
np.random.normal(size=(16,5)),
columns=list('ABCDE')
)
df
A B C D E
0 -1.685112 -0.432143 0.876200 1.626578 1.512677
1 0.401134 0.439393 1.027222 0.036267 -0.655949
2 -0.074890 0.312793 -0.236165 0.660909 0.074468
3 0.842169 2.759467 0.223652 0.432631 -0.484871
4 -0.619873 -1.738938 -0.054074 0.337663 0.358380
5 0.083653 0.792835 -0.643204 1.182606 -1.207692
6 -1.168773 -1.456870 -0.707450 -0.439400 0.319728
7 2.316974 -0.177750 1.289067 -2.472729 -1.310188
8 2.354769 1.099483 -0.653342 -0.532208 0.269307
9 0.431649 0.666982 0.361765 0.419482 0.531072
10 -0.124268 -0.170720 -0.979012 -0.410861 1.000371
11 -0.392863 0.933516 -0.502608 -0.759474 -1.364289
12 1.405442 -0.297977 0.477609 -0.046791 -0.126504
13 -0.711799 -1.042558 -0.970183 -1.672715 -0.524283
14 0.029966 -0.579152 0.648176 0.833141 -0.942752
15 0.824767 0.974580 0.363170 0.428062 -0.232174
The desired outcome should look something like this:
## I want to ONLY identify the outliers NOT remove or substitute them. I only used NaN to depict the outlier value. Ideally, the outlier values cell should be colored/highlighted 'red'.
## Please note: the outliers NaN in the sample are randomly assigned.
A B C D E Anomaly_Count
0 NaN -0.432143 0.876200 NaN 1.512677 2
1 0.401134 0.439393 1.027222 0.036267 -0.655949 0
2 -0.074890 0.312793 -0.236165 0.660909 0.074468 0
3 0.842169 NaN 0.223652 0.432631 -0.484871 1
4 -0.619873 -1.738938 -0.054074 0.337663 0.358380 0
5 0.083653 0.792835 -0.643204 NaN NaN 2
6 -1.168773 -1.456870 -0.707450 -0.439400 0.319728 0
7 2.316974 -0.177750 1.289067 -2.472729 -1.310188 0
8 2.354769 1.099483 -0.653342 -0.532208 0.269307 0
9 0.431649 0.666982 0.361765 0.419482 0.531072 0
10 -0.124268 -0.170720 -0.979012 -0.410861 NaN 1
11 -0.392863 0.933516 -0.502608 -0.759474 -1.364289 0
12 1.405442 -0.297977 0.477609 -0.046791 -0.126504 0
13 -0.711799 -1.042558 -0.970183 -1.672715 -0.524283 0
14 0.029966 -0.579152 0.648176 0.833141 -0.942752 0
15 0.824767 NaN 0.363170 0.428062 -0.232174 1
See below for my attempt, I am open to other approaches
import numpy as np
from scipy import stats
def outlier_detection (data):
# step I: identify the outliers in each row
df[(np.abs(stats.zscore(df)) < 3).all(axis = 0)] # unfortunately this removes the outliers which I dont want
# step II: color/highlight the outlier cell
df = df.style.highlight_null('red')
# Step III: count the number of outliers in each row
df['Anomaly_count'] = df.isnull().sum(axis=1)
# step IV: export as xlsx file
df.to_excel(r'Path to store the exported excel file\File Name.xlsx', sheet_name='Your sheet name', index = False)
outlier_detection(df)
Thanks for your time.
This works for me
import numpy as np
import pandas as pd
from scipy import stats
np.random.seed([5, 1591])
df = pd.DataFrame(
np.random.normal(size=(16, 5)),
columns=list('ABCDE')
)
mask = pd.DataFrame(abs(stats.zscore(df)) > 1, columns=df.columns)
df["Count"] = mask.sum(axis=1)
mask["Count"] = False
style_df = mask.applymap(lambda x: "background-color: red" if x else "")
sheet_name = "Values"
with pd.ExcelWriter("score_test.xlsx", engine="openpyxl") as writer:
df.style.apply(lambda x: style_df, axis=None).to_excel(writer,
sheet_name=sheet_name,
index=False)
Here the mask is the boolean conditional where we have true if zscore exceeds the limit. Based on this boolean mask I create a string dataframe 'style_df' with the values 'background: red' on the deviating cells. The values of the style_df is imposed with the last statement on the style of the df dataframe.
The resulting excel file looks now like this

python stacking data with missing values in the header

I have data that is imported from a csv file, in reality there are more columns and more cycles, but this is a representative snippet:
Export date 2020-10-10
Record #3 Record #2 Record #1
Cycle #5 Cycle #4 Cycle #3
time ( min.) Parameter1 Something2 Whatever3 Parameter1 Something2 Whatever3 Parameter1 Something2 Whatever3
0 0.0390625 9.89619 0.853909 14.409 10.1961 0.859037 14.4676 10.0274 0.832598
1 0.0390625 9.53452 0.949844 14.4096 10.3034 1.224 14.4676 10.0323 1.20403
2 0.0390625 9.8956 1.47227 14.4097 10.6586 1.14486 14.4676 10.4936 1.12747
3 0.0390625 10.7829 1.44412 14.4097 10.9185 1.20247 14.5116 10.6892 1.12459
The top part of the data contains a row (export date) that is not needed in the table.
I would like to stack the data so that there will be Cycle and Record columns. The problem is that these values are found only above the first column of data for every cycle. For example, Cycle5 has three columns of data, then Cycle4 has three columns of data etc.
This is how the output should look like:
I didn't get very far:
df = pd.read_csv('cycles.csv')
#Fill the names of cycles to the right
df.ffill(axis = 1, inplace = True)
#Not sure this is needed, it might make it easier to melt/stack
df.iloc[0,0] = "time ( min.)"
df.iloc[1,0] = "time ( min.)"
Thank you for your thoughts and assistance!!
There are couple of problems with this which you need to address all:
Firstly Read all the required info:
This cannot be done unless all of the info is read separately:
import pandas as pd
from io import StringIO
string = open('SO.csv').read()
records = [i.split('#')[1].strip() for i in string.split('\n')[1].split(',') if '#' in i]
cycles = [i.split('#')[1].strip() for i in string.split('\n')[2].split(',') if '#' in i]
data = pd.read_csv(StringIO(string), sep=',', header=3).dropna(how = 'any')
Rename columns so they follow a pattern:
cols = [col for col in data.columns if '.' not in col]
data = data.rename(columns = dict(zip(cols ,[col+'.0' for col in cols])))
Build a loop to pluck out records for each record and cycle:
dfs = []
for rdx, rec in enumerate(records):
df = data[['time ( min.)'].__add__([col for col in data.columns if col.endswith(str(rdx))])].rename(columns = dict(zip([col+f'.{rdx}' for col in cols],cols)))
df[['Cycle', 'Record']] = cycles[rdx], records[rdx]
dfs.append(df)
Finally Merge them all:
pd.concat(dfs)
This results in:
time ( min.) Parameter1 Something2 Whatever3 Cycle Record
0 0.0 0.039062 9.89619 0.853909 5 3
1 1.0 0.039062 9.53452 0.949844 5 3
2 2.0 0.039062 9.89560 1.472270 5 3
3 3.0 0.039062 10.78290 1.444120 5 3
0 0.0 14.409000 10.19610 0.859037 4 2
1 1.0 14.409600 10.30340 1.224000 4 2
2 2.0 14.409700 10.65860 1.144860 4 2
3 3.0 14.409700 10.91850 1.202470 4 2
0 0.0 14.467600 10.02740 0.832598 3 1
1 1.0 14.467600 10.03230 1.204030 3 1
2 2.0 14.467600 10.49360 1.127470 3 1
3 3.0 14.511600 10.68920 1.124590 3 1
Breaking down a problem in simple steps will not only help you go with this one but also in EVERY OTHER case. Just figure out what you need to do, break into steps and go with it!

Select multiple columns within a certain range of another column

1 0 0 0.579322
2 0 0 0.579306
3 0 0 0.279274
4 5 0 0.579224
5 3 0 0.579157
3 0 0 0.47907
7 0 1 0.378963
8 9 0 0.578833
I'm a beginner in python and struggling to do this. I have four columns like above mentioned, I need to save 1,2,3 columns which have the value greater than 0.4 and less than 0.5 in column 4. Can this be done via numpy?
This is the code I tried.
import csv
csv_out = csv.writer(open('data_new.csv', 'w'), delimiter=',')
f = open('coordiantes.txt',"w+")
for line in f:
vals = line.split('\t')
for vals ([3]>=0.4 & vals[3]<=0.5):
print vals[0],vals[1],vals[2]
csv_out.writerow(vals[0], vals[1], vals[2],vals[3])
f.close()
Can be done with a few built in numpy functions
vals = #your array
#do a Boolean index of your array where the fourth column meets your criteria
vals = vals[np.where((vals[:,3] <=0.5)&(vals[:,3]>0.4))]
#use numpy to slice off last column and to save the file
np.savetxt('coordiantes.txt',vals[:,:3],delimiter=',')
You can do the following:
import numpy as np
data = np.loadtxt('coordinates.txt')
idx = np.where((data[:,3] <= 0.5) & (data[:,3] > 0.4))[0] # save where col 4's data is in (0.4,0.5]
selected_data = data[idx,:3] # get the 1st three cols for the rows of interest
np.savetxt('data_new.csv', selected_data, delimiter=',')

How do I convert a row from a pandas DataFrame from a Series back to a DataFrame?

I am iterating through the rows of a pandas DataFrame, expanding each one out into N rows with additional info on each one (for simplicity I've made it a random number here):
from pandas import DataFrame
import pandas as pd
from numpy import random, arange
N=3
x = DataFrame.from_dict({'farm' : ['A','B','A','B'],
'fruit':['apple','apple','pear','pear']})
out = DataFrame()
for i,row in x.iterrows():
rows = pd.concat([row]*N).reset_index(drop=True) # requires row to be a DataFrame
out = out.append(rows.join(DataFrame({'iter': arange(N), 'value': random.uniform(size=N)})))
In this loop, row is a Series object, so the call to pd.concat doesn't work. How do I convert it to a DataFrame? (Eg. the difference between x.ix[0:0] and x.ix[0])
Thanks!
Given what you commented, I would try
def giveMeSomeRows(group):
return random.uniform(low=group.low, high=group.high, size=N)
results = x.groupby(['farm', 'fruit']).apply(giveMeSomeRows)
This should give you a separate result dataframe. I have assumed that every farm-fruit combination is unique... there might be other ways, if we'd know more about your data.
Update
Running code example
def giveMeSomeRows(group):
return random.uniform(low=group.low, high=group.high, size=N)
N = 3
df = pd.DataFrame(arange(0,8).reshape(4,2), columns=['low', 'high'])
df['farm'] = 'a'
df['fruit'] = arange(0,4)
results = df.groupby(['farm', 'fruit']).apply(giveMeSomeRows)
df
low high farm fruit
0 0 1 a 0
1 2 3 a 1
2 4 5 a 2
3 6 7 a 3
results
farm fruit
a 0 [0.176124290969, 0.459726835079, 0.999564934689]
1 [2.42920143009, 2.37484506501, 2.41474002256]
2 [4.78918572452, 4.25916442343, 4.77440617104]
3 [6.53831891152, 6.23242754976, 6.75141668088]
If instead you want a dataframe, you can update the function to
def giveMeSomeRows(group):
return pandas.DataFrame(random.uniform(low=group.low, high=group.high, size=N))
results
0
farm fruit
a 0 0 0.281088
1 0.020348
2 0.986269
1 0 2.642676
1 2.194996
2 2.650600
2 0 4.545718
1 4.486054
2 4.027336
3 0 6.550892
1 6.363941
2 6.702316

Pandas DataFrame group by value and get column & row indexes

I have a pandas DataFrame like following.
df = pandas.DataFrame(np.random.randn(5,5),columns=['1','2','3','4','5'])
1 2 3 4 5
0 0.877455 -1.215212 -0.453038 -1.825135 0.440646
1 1.640132 -0.031353 1.159319 -0.615796 0.763137
2 0.132355 -0.762932 -0.909496 -1.012265 -0.695623
3 -0.257547 -0.844019 0.143689 -2.079521 0.796985
4 2.536062 -0.730392 1.830385 0.694539 -0.654924
I need to get row and column indexes for following three groups. (In my original dataset there are no negative values)
value is greater than 2.0
value is between 1.0 - 2.0
value is less than 1.0
For e.g for "value is greater than 2.0" it should return [1,4]. I have tried using this which gives a boolean result.
df.values > 2
You can use np.where on the boolean result to extract the indices:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(5,5),columns=['1','2','3','4','5'])
condition = df.values > 2
print np.column_stack(np.where(condition))
For a df like this,
1 2 3 4 5
0 0.057347 0.722251 0.263292 -0.168865 -0.111831
1 -0.765375 1.040659 0.272883 -0.834273 -0.126997
2 -0.023589 0.046002 1.206445 0.381532 -1.219399
3 2.290187 2.362249 -0.748805 -1.217048 -0.973749
4 0.100084 0.671120 -0.211070 0.903264 -0.312815
Output:
[[3 0]
[3 1]]
Or get a list of row-column index pairs if necessary:
print map(list, np.column_stack(np.where(condition)))
Output:
[[3,0], [3,1]]

Categories

Resources