find difference between multiple columns in a dataframe - python

I am working on a dataframe,
0
1
2
3
4
5
6
7
new_width
new_height
new_depth
audited_Width
audited_Height
audited_Depth
inf
val
----------
-----------
----------
--------------
---------------
--------------
---
---
35.00
2.00
21.00
21.00
2.50
35.00
T
12.00
4.40
10.60
11.60
4.40
12.00
T
20.50
17.00
5.50
21.50
17.05
20.50
F
24.33
22.00
18.11
24.00
22.05
24.33
T
23.00
23.00
19.00
19.00
23.00
23.00
F
Here i want to find difference between rows (0, 3) and (1,4) and (2,5) and verify if the difference value(any one or all the three) falls in the range(0,1), and if yes then it should check the corresponding cell in row 6 and if it is 'T', then print 'YES' in corresponding cell in row 7!
I have the following code:
a=df['new_width'] - df['audited_Width']
for i in a:
if (i in range (0,1))==True:
df['Value'] = 'Yes'
print(df['Value'])
I know that 4th line is incorrect. What alternatives can I use to get the desired output?

You shouldn't iterate over the rows of a DataFrame. Instead, you can create a mask to select the rows that meet your condition, and then use that to fill the "val" column:
mask = ((df["new_width"] - df["audited_Width"]).between(0, 1) \
| (df["new_height"] - df["audited_Height"]).between(0, 1) \
| (df["new_depth"] - df["audited_Depth"]).between(0, 1)) \
& (df["inf"] == "T")
df["val"] = df["val"].where(~mask, "YES")
This outputs:
new_width new_height new_depth audited_Width audited_Height audited_Depth inf val
0 35.00 2.0 21.00 21.0 2.50 35.00 T NaN
1 12.00 4.4 10.60 11.6 4.40 12.00 T YES
2 20.50 17.0 5.50 21.5 17.05 20.50 F NaN
3 24.33 22.0 18.11 24.0 22.05 24.33 T YES
4 23.00 23.0 19.00 19.0 23.00 23.00 F NaN

Custom for loops are basically never the best option when it comes to pandas.
This is a method that reshapes your dataframe to an arguably better shape, performs a simple check on the new shape, and then extracts indices that should be modified in the original dataframe.
df.columns = df.columns.str.lower()
df2 = pd.wide_to_long(df.reset_index(), ['new', 'audited'], ['index'], 'values', '_', '\w+')
mask = df2[df2.new.sub(df2.audited).between(0,1) & df2.inf.eq('T')]
idx = mask.reset_index('index')['index'].unique()
df.loc[idx, 'val'] = 'YES'
print(df)
Output:
new_width new_height new_depth audited_width audited_height audited_depth inf val
0 35.00 2.0 21.00 21.0 2.50 35.00 T NaN
1 12.00 4.4 10.60 11.6 4.40 12.00 T YES
2 20.50 17.0 5.50 21.5 17.05 20.50 F NaN
3 24.33 22.0 18.11 24.0 22.05 24.33 T YES
4 23.00 23.0 19.00 19.0 23.00 23.00 F NaN

Related

Get the subarray with same numbers and consecutive index

I have a text file like this
0, 23.00, 78.00, 75.00, 105.00, 2,0.97
1, 371.00, 305.00, 38.00, 48.00, 0,0.85
1, 24.00, 78.00, 75.00, 116.00, 2,0.98
1, 372.00, 306.00, 37.00, 48.00, 0,0.84
2, 28.00, 87.00, 74.00, 101.00, 2,0.97
2, 372.00, 307.00, 35.00, 47.00, 0,0.80
3, 32.00, 86.00, 73.00, 98.00, 2,0.98
3, 363.00, 310.00, 34.00, 46.00, 0,0.83
4, 40.00, 77.00, 71.00, 98.00, 2,0.94
4, 370.00, 307.00, 38.00, 47.00, 0,0.84
4, 46.00, 78.00, 74.00, 116.00, 2,0.97
5, 372.00, 308.00, 34.00, 46.00, 0,0.57
5, 43.00, 66.00, 67.00, 110.00, 2,0.96
Code I tried
frames = []
x = []
y = []
labels = []
with open(file, 'r') as lb:
for line in lb:
line = line.replace(',', ' ')
arr = line.split()
frames.append(arr[0])
x.append(arr[1])
y.append(arr[2])
labels.append(arr[5])
print(np.shape(frames))
for d, a in enumerate(frames):
compare = []
if a == frames[d+2]:
compare.append(x[d])
compare.append(x[d+1])
compare.append(x[d+2])
xm = np.argmin(compare)
label = {0: int(labels[d]), 1: int(labels[d+1]), 2: int(labels[d+2])}.get(xm)
elif a == frames[d+1]:
compare.append(x[d])
compare.append(x[d+1])
xm = np.argmin(compare)
label = {0: int(labels[d]), 1: int(labels[d+1])}.get(xm)
In the first line, because the first number (0) is unique so I extract the sixth number (2) easily.
But after that, I got many lines with the same first number, so I want somehow to store all the lines with the same first number to compare the second number, then extract the sixth number of the line which has the lowest second number.
Can someone provide python solutions for me? I tried readline() and next() but don't know how to solve it.
you can read the file with pandas.read_csv instead, and things will come much more easily
import pandas as pd
df = pd.read_csv(file_path, header = None)
You'll read the file as a table
0 1 2 3 4 5 6
0 0 23.0 78.0 75.0 105.0 2 0.97
1 1 371.0 305.0 38.0 48.0 0 0.85
2 1 24.0 78.0 75.0 116.0 2 0.98
3 1 372.0 306.0 37.0 48.0 0 0.84
4 2 28.0 87.0 74.0 101.0 2 0.97
5 2 372.0 307.0 35.0 47.0 0 0.80
6 3 32.0 86.0 73.0 98.0 2 0.98
7 3 363.0 310.0 34.0 46.0 0 0.83
8 4 40.0 77.0 71.0 98.0 2 0.94
9 4 370.0 307.0 38.0 47.0 0 0.84
10 4 46.0 78.0 74.0 116.0 2 0.97
11 5 372.0 308.0 34.0 46.0 0 0.57
12 5 43.0 66.0 67.0 110.0 2 0.96
then you can group in subtables based on one of the columns (in your case column 0)
for group, sub_df in d.groupby(0):
row = sub_df[1].idxmin() # returns the index of the minimum value for column 1
df.loc[row, 5] # this is the number you are looking for
I think this is what you need using pandas:
import pandas as pd
df = pd.read_table('./test.txt', sep=',', names = ('1','2','3','4','5','6','7'))
print(df)
# 1 2 3 4 5 6 7
# 0 0 23.0 78.0 75.0 105.0 2 0.97
# 1 1 371.0 305.0 38.0 48.0 0 0.85
# 2 1 24.0 78.0 75.0 116.0 2 0.98
# 3 1 372.0 306.0 37.0 48.0 0 0.84
# 4 2 28.0 87.0 74.0 101.0 2 0.97
# 5 2 372.0 307.0 35.0 47.0 0 0.80
# 6 3 32.0 86.0 73.0 98.0 2 0.98
# 7 3 363.0 310.0 34.0 46.0 0 0.83
# 8 4 40.0 77.0 71.0 98.0 2 0.94
# 9 4 370.0 307.0 38.0 47.0 0 0.84
# 10 4 46.0 78.0 74.0 116.0 2 0.97
# 11 5 372.0 308.0 34.0 46.0 0 0.57
# 12 5 43.0 66.0 67.0 110.0 2 0.96
df_new = df.loc[df.groupby("1")["6"].idxmin()]
print(df_new)
# 1 2 3 4 5 6 7
# 0 0 23.0 78.0 75.0 105.0 2 0.97
# 1 1 371.0 305.0 38.0 48.0 0 0.85
# 5 2 372.0 307.0 35.0 47.0 0 0.80
# 7 3 363.0 310.0 34.0 46.0 0 0.83
# 9 4 370.0 307.0 38.0 47.0 0 0.84
# 11 5 372.0 308.0 34.0 46.0 0 0.57

adding new rows to an existing dataframe

This is my dataframe. How to I add max_value, min_value, mean_value, median_value names to rows so that my index values will be like
0
1
2
3
4
max_value
min_value
mean_value
median_value
Could anyone help me in solving this
If want add rows use add DataFrame.agg:
df1 = df.append(df.agg(['max','min','mean','median']))
If want add columns use assign with min, max, mean and median:
df2 = df.assign(max_value=df.max(axis=1),
min_value=df.min(axis=1),
mean_value=df.mean(axis=1),
median_value=df.median(axis=1))
one Way is,
Thanks to #jezrael for the help.
df = pd.DataFrame(np.random.randint(0,100,size=(5, 4)), columns=list('ABCD'))
df1=df.copy()
#column wise calc
df.loc['max']=df1.max()
df.loc['min']=df1.min()
df.loc['mean']=df1.mean()
df.loc['median']=df1.median()
#row wise calc
df['max']=df1.max(axis=1)
df['min']=df1.min(axis=1)
df['mean']=df1.mean(axis=1)
df['median']=df1.median(axis=1)
O/P:
A B C D max min mean median
0 49.0 91.0 16.0 17.0 91.0 16.0 43.25 33.0
1 20.0 42.0 86.0 60.0 86.0 20.0 52.00 51.0
2 32.0 25.0 94.0 13.0 94.0 13.0 41.00 28.5
3 40.0 1.0 66.0 31.0 66.0 1.0 34.50 35.5
4 18.0 30.0 67.0 31.0 67.0 18.0 36.50 30.5
max 49.0 91.0 94.0 60.0 NaN NaN NaN NaN
min 18.0 1.0 16.0 13.0 NaN NaN NaN NaN
mean 31.8 37.8 65.8 30.4 NaN NaN NaN NaN
median 32.0 30.0 67.0 31.0 NaN NaN NaN NaN
This worked well and fine:
df1 = df.copy()
df.loc['max']=df1.max()
df.loc['min']=df1.min()
df.loc['mean']=df1.mean()
df.loc['median']=df1.median()

Why is pandas showing "?" instead of NaN

I'm learning pandas and when i display the data frame, it is displaying ? instead of NaN.
Why is it so?
CODE :
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-
databases/autos/imports-85.data"
df = pd.read_csv(url, header=None)
print(df.head())
headers = ["symboling", "normalized-losses", "make", "fuel-type",
"aspiration",
"num-of-doors", "body-style", "drive-wheels", "engine-location",
"wheel-base", "length", "width", "height", "curb-weight",
"engine-type", "num-of-cylinders", "engine-size", "fuel-system",
"bore", "stroke", "compression-ratio", "hoursepower", "peak-rpm",
"city-mpg", "highway-mpg", "price"]
df.columns=headers
print(df.head(30))
In data are missing values represented by ?, so for converting them is possible use parameter na_values, also names parameter in read_csv add columns by list, so assign is not necessary:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data"
headers = ["symboling", "normalized-losses", "make", "fuel-type", "aspiration",
"num-of-doors", "body-style", "drive-wheels", "engine-location",
"wheel-base", "length", "width", "height", "curb-weight",
"engine-type", "num-of-cylinders", "engine-size", "fuel-system",
"bore", "stroke", "compression-ratio", "hoursepower", "peak-rpm",
"city-mpg", "highway-mpg", "price"]
df = pd.read_csv(url, header=None, names=headers, na_values='?')
print(df.head(10))
symboling normalized-losses make fuel-type aspiration \
0 3 NaN alfa-romero gas std
1 3 NaN alfa-romero gas std
2 1 NaN alfa-romero gas std
3 2 164.0 audi gas std
4 2 164.0 audi gas std
5 2 NaN audi gas std
6 1 158.0 audi gas std
7 1 NaN audi gas std
8 1 158.0 audi gas turbo
9 0 NaN audi gas turbo
num-of-doors body-style drive-wheels engine-location wheel-base ... \
0 two convertible rwd front 88.6 ...
1 two convertible rwd front 88.6 ...
2 two hatchback rwd front 94.5 ...
3 four sedan fwd front 99.8 ...
4 four sedan 4wd front 99.4 ...
5 two sedan fwd front 99.8 ...
6 four sedan fwd front 105.8 ...
7 four wagon fwd front 105.8 ...
8 four sedan fwd front 105.8 ...
9 two hatchback 4wd front 99.5 ...
engine-size fuel-system bore stroke compression-ratio hoursepower \
0 130 mpfi 3.47 2.68 9.0 111.0
1 130 mpfi 3.47 2.68 9.0 111.0
2 152 mpfi 2.68 3.47 9.0 154.0
3 109 mpfi 3.19 3.40 10.0 102.0
4 136 mpfi 3.19 3.40 8.0 115.0
5 136 mpfi 3.19 3.40 8.5 110.0
6 136 mpfi 3.19 3.40 8.5 110.0
7 136 mpfi 3.19 3.40 8.5 110.0
8 131 mpfi 3.13 3.40 8.3 140.0
9 131 mpfi 3.13 3.40 7.0 160.0
peak-rpm city-mpg highway-mpg price
0 5000.0 21 27 13495.0
1 5000.0 21 27 16500.0
2 5000.0 19 26 16500.0
3 5500.0 24 30 13950.0
4 5500.0 18 22 17450.0
5 5500.0 19 25 15250.0
6 5500.0 19 25 17710.0
7 5500.0 19 25 18920.0
8 5500.0 17 20 23875.0
9 5500.0 16 22 NaN
[10 rows x 26 columns]
This information is here:
https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.names:
Missing Attribute Values: (denoted by "?")
Another solution: if you want to replace ? by NaN after reading the data, you can do this:
df_new = df.replace({'?':np.nan})

Pandas converting timestamp and monthly summary

I have several .csv files which I am importing via Pandas and then work out a summary of the data (min, max, mean), ideally weekly and monthly reports. I have the following code, but just do not seem to get the month summary to work, I am sure the problem is with the timestamp conversion.
What am I doing wrong?
import pandas as pd
import numpy as np
#Format of the data that is been imported
#2017-05-11 18:29:14+00:00,264.0,987.99,26.5,23.70,512.0,11.763,52.31
df = pd.read_csv('data.csv')
df['timestamp'] = pd.to_datetime(df['time'], format='%Y-%m-%d %H:%M:%S')
print 'month info'
print [g for n, g in df.groupby(pd.Grouper(key='timestamp',freq='M'))]
print(data.groupby('timestamp')['light'].mean())
IIUC, you almost have it, and your datetime conversion is fine. Here is an example:
Starting from a dataframe like this (which is your example row, duplicated with slight modifications):
>>> df
time x y z a b c d
0 2017-05-11 18:29:14+00:00 264.0 947.99 24.5 53.7 511.0 11.463 12.31
1 2017-05-15 18:29:14+00:00 265.0 957.99 25.5 43.7 512.0 11.563 22.31
2 2017-05-21 18:29:14+00:00 266.0 967.99 26.5 33.7 513.0 11.663 32.31
3 2017-06-11 18:29:14+00:00 267.0 977.99 26.5 23.7 514.0 11.763 42.31
4 2017-06-22 18:29:14+00:00 268.0 997.99 27.5 13.7 515.0 11.800 52.31
You can do what you did before with your datetime:
df['timestamp'] = pd.to_datetime(df['time'], format='%Y-%m-%d %H:%M:%S')
And then get your summaries either separately:
monthly_mean = df.groupby(pd.Grouper(key='timestamp',freq='M')).mean()
monthly_max = df.groupby(pd.Grouper(key='timestamp',freq='M')).max()
monthly_min = df.groupby(pd.Grouper(key='timestamp',freq='M')).min()
weekly_mean = df.groupby(pd.Grouper(key='timestamp',freq='W')).mean()
weekly_min = df.groupby(pd.Grouper(key='timestamp',freq='W')).min()
weekly_max = df.groupby(pd.Grouper(key='timestamp',freq='W')).max()
# Examples:
>>> monthly_mean
x y z a b c d
timestamp
2017-05-31 265.0 957.99 25.5 43.7 512.0 11.5630 22.31
2017-06-30 267.5 987.99 27.0 18.7 514.5 11.7815 47.31
>>> weekly_mean
x y z a b c d
timestamp
2017-05-14 264.0 947.99 24.5 53.7 511.0 11.463 12.31
2017-05-21 265.5 962.99 26.0 38.7 512.5 11.613 27.31
2017-05-28 NaN NaN NaN NaN NaN NaN NaN
2017-06-04 NaN NaN NaN NaN NaN NaN NaN
2017-06-11 267.0 977.99 26.5 23.7 514.0 11.763 42.31
2017-06-18 NaN NaN NaN NaN NaN NaN NaN
2017-06-25 268.0 997.99 27.5 13.7 515.0 11.800 52.31
Or aggregate them all together to get a multi-indexed dataframe with your summaries:
monthly_summary = df.groupby(pd.Grouper(key='timestamp',freq='M')).agg(['mean', 'min', 'max'])
weekly_summary = df.groupby(pd.Grouper(key='timestamp',freq='W')).agg(['mean', 'min', 'max'])
# Example of summary of row 'x':
>>> monthly_summary['x']
mean min max
timestamp
2017-05-31 265.0 264.0 266.0
2017-06-30 267.5 267.0 268.0
>>> weekly_summary['x']
mean min max
timestamp
2017-05-14 264.0 264.0 264.0
2017-05-21 265.5 265.0 266.0
2017-05-28 NaN NaN NaN
2017-06-04 NaN NaN NaN
2017-06-11 267.0 267.0 267.0
2017-06-18 NaN NaN NaN
2017-06-25 268.0 268.0 268.0

Dropping multiple columns in pandas at once

I have a data set consisting of 135 columns. I am trying to drop the columns which have empty data of more than 60%. There are some 40 columns approx in it. So, I wrote a function to drop this empty columns. But I am getting "Not contained in axis" error. Could some one help me solving this?. Or any other way to drop this 40 columns at once?
My function:
list_drop = df.isnull().sum()/(len(df))
def empty(df):
if list_drop > 0.5:
df.drop(list_drop,axis=1,inplace=True)
return df
Other method i tried:
df.drop(df.count()/len(df)<0.5,axis=1,inplace=True)
You could use isnull + sum and then use the mask to filter df.columns.
m = df.isnull().sum(0) / len(df) < 0.6
df = df[df.columns[m]]
Demo
df
A B C
0 29.0 NaN 26.6
1 NaN NaN 23.3
2 23.0 94.0 28.1
3 35.0 168.0 43.1
4 NaN NaN 25.6
5 32.0 88.0 31.0
6 NaN NaN 35.3
7 45.0 543.0 30.5
8 NaN NaN NaN
9 NaN NaN 37.6
10 NaN NaN 38.0
11 NaN NaN 27.1
12 23.0 846.0 30.1
13 19.0 175.0 25.8
14 NaN NaN 30.0
15 47.0 230.0 45.8
16 NaN NaN 29.6
17 38.0 83.0 43.3
18 30.0 96.0 34.6
m = df.isnull().sum(0) / len(df) < 0.3 # 0.3 as an example
m
A False
B False
C True
dtype: bool
df[df.columns[m]]
C
0 26.6
1 23.3
2 28.1
3 43.1
4 25.6
5 31.0
6 35.3
7 30.5
8 NaN
9 37.6
10 38.0
11 27.1
12 30.1
13 25.8
14 30.0
15 45.8
16 29.6
17 43.3
18 34.6

Categories

Resources