How to find missing values in groups

How to find missing values in groups - python

I have a large dataset of restaurant inspections. One inspection will trigger several code violations. I want to find out if any inspections don't contain a specific code violation (for evidence of pests). I have the data in a Pandas data frame.
I tried separating the data frame based on whether the violation for pests was included. And I tried to group by the violation code. Can't seem to figure it out.
With the pest violation as "3A", data could look like:
import pandas as pd
df = pd.DataFrame(data = {
'visit' : ['1', '1', '1', '2', '2', '3', '3'],
'violation' : ['3A', '4B', '5C', '3A', '6C', '7D', '8E']
})
visit violation
0 1 3A
1 1 4B
2 1 5C
3 2 3A
4 2 6C
5 3 7D
6 3 8E
I'd like to end up with this:
result = pd.DataFrame(data = {
'visit' : ['3', '3'], 'violation' : ['7D', '8E']
})
Out[15]:
visit violation
0 3 7D
1 3 8E

Try using:
value = '3A'
print(df.groupby('visit').filter(lambda x: all(value != i for i in x['violation'])))
Output:
violation visit
5 7D 3
6 8E 3

Another approach would be:
violation_visits = df[df['violation']=='3A']['visit'].unique()
df[~df['visit'].isin(violation_visits.tolist())]
Out[16]:
visit violation
5 3 7D
6 3 8E

One way using filter
df.groupby('visit').filter(lambda x : ~x['violation'].eq('3A').any())
visit violation
5 3 7D
6 3 8E
Another way using transform
df[df.violation.ne('3A').groupby(df.visit).transform('all')]
visit violation
5 3 7D
6 3 8E

Related

Formatting a 2D list using iteration - even spacing problem [duplicate]

This question already has answers here:
Printing Lists as Tabular Data
(20 answers)
Closed 1 year ago.
I have the following 2D list:
table = [['Position', 'Club', 'MP', 'GD', 'Points'],
['1', 'Man City', '38', '51', '86'],
['2', 'Man Utd', '38', '29', '74'],
['3', 'Liverpool', '38', '26', '69'],
['4', 'Chelsea', '38', '22', '67'],
['5', 'Leicester', '38', '18', '66']]
I am wanting to print it so that the format is as following:
Position Club MP GD Points
1 Man City 38 51 86
2 Man Utd 38 29 74
3 Liverpool 38 26 69
4 Chelsea 38 22 67
5 Leicester 38 18 66
My issue is with the even spacing. My attempt at solving this was using:
for i in range(len(table)):
print(*table[i], sep=" "*(15-len(table[i])))
However, I realised that problem is that 'i' refers to the number of items within each row, rather than the length of each individual item, which is what it would take to make even spacing - I think.
How can I get my desired format? And is my approach okay or is there a much better way of approaching this?
I have looked at this - 2D list output formatting - which helped with aspects but not with the even spacing problem I don't think
Any help would be much appreciated, thank you!

You can use str.format for formatting the table (documentation):
table = [
["Position", "Club", "MP", "GD", "Points"],
["1", "Man City", "38", "51", "86"],
["2", "Man Utd", "38", "29", "74"],
["3", "Liverpool", "38", "26", "69"],
["4", "Chelsea", "38", "22", "67"],
["5", "Leicester", "38", "18", "66"],
]
format_string = "{:<15}" * 5
for row in table:
print(format_string.format(*row))
Prints:
Position Club MP GD Points
1 Man City 38 51 86
2 Man Utd 38 29 74
3 Liverpool 38 26 69
4 Chelsea 38 22 67
5 Leicester 38 18 66

You should use the string formatting approach, but the problem with your current solution is that you need to consider the length of each item in the row. So you want something like:
for row in table:
for item in row:
print(item, end=' '*(15 - len(item)))
print()
Note, you almost never want to use for index in range(len(some_list)):, instead, iterate directly over the list. Even in the cases where you do need the index, you almost certainly would end up using enumerate instead of range. Here's the equivalent using ranges... but again, you wouldn't do it this way, it isn't Pythonic:
for i in range(len(table)):
for j in range(len(table[i])):
print(table[i][j], end=' '*(15 - len(table[i][j])))
print()

Sorting one group of columns according to another group of columns

This question is similar to this, the last time I thought it will be that simple but it seems not (thank you #anon01 and #Ch3steR who answered my previous question there)
so here is my new dataframe
matrix = [(1, 3, 2, "1a", "3a", "2a"),
(6, 5, 4, "6a", "5a", "4a"),
(8, 7, 9, "8a", "7a", "9a"),
]
df = pd.DataFrame(matrix, index=list('abc'), columns=["price1","price2","price3","product1","product2","product3"])
price1 price2 price3 product1 product2 product3
a 1 3 2 1a 3a 2a
b 6 5 4 6a 5a 4a
c 8 7 9 8a 7a 9a
I need to sort by price within each row but price and product is a pairs so if the price move to price1 then the product also need to move to product1 because they are pairs
Here is the output will be
price1 price2 price3 product1 product2 product3
a 1 2 3 1a 2a 3a
b 4 5 6 4a 5a 6a
c 7 8 9 7a 8a 9a
from the last question, I tried the suggested solution using np.sort it can work to sort the price but if I have another column it is not working. I tried to rematching the product with the price but I think it will cost more so I still using my previous brute force solution as using swapping from this link
df.loc[df['price1']>df['price2'],['price1','price2','product1','product2']] = df.loc[df['price1']>df['price2'],['price2','price1','product2','product1']].values
df.loc[df['price1']>df['price3'],['price1','price3','product1','product3']] = df.loc[df['price1']>df['price3'],['price3','price1','product3','product1']].values
df.loc[df['price2']>df['price3'],['price2','price3','product2','product3']] = df.loc[df['price2']>df['price3'],['price3','price2','product3','product2']].values
The problem is I have more pairs than 3
if someone has an idea for this matter, it will be very helpful, thank you

We can make use of numpy.sort for "price", and numpy.argsort for product. This is all vectorized by numpy.
# Gets all "price" columns
price = df.filter(like='price')
# Gets all "product" columns
product = df.filter(like='product')
# Sorts "price" columns row-wise and assigns an array back
df[price.columns] = np.sort(price.to_numpy(), axis=1)
# Builds indices for re-organizing "product" based on sorted "price"
ix = np.arange(product.shape[1])[:,None]
iy = np.argsort(price.to_numpy(), axis=1)
# Re-arranges "product" array and assigns it back
df[product.columns] = product.to_numpy()[ix, iy]
df
price1 price2 price3 product1 product2 product3
a 1 2 3 1a 2a 3a
b 4 5 6 4a 5a 6a
c 7 8 9 7a 8a 9a

Pivoting count of column value using python pandas

I have student data with id's and some values and I need to pivot the table for count of ID.
Here's an example of data:
id name maths science
0 B001 john 50 60
1 B021 Kenny 89 77
2 B041 Jessi 100 89
3 B121 Annie 91 73
4 B456 Mark 45 33
pivot table:
count of ID
5

Lots of different ways to approach this, I would use either shape or nunique() as Sandeep suggested.
data = {'id' : ['0','1','2','3','4'],
'name' : ['john', 'kenny', 'jessi', 'Annie', 'Mark'],
'math' : [50,89,100,91,45],
'science' : [60,77,89,73,33]}
df = pd.DataFrame(data)
print(df)
id name math science
0 0 john 50 60
1 1 kenny 89 77
2 2 jessi 100 89
3 3 Annie 91 73
4 4 Mark 45 33
then pass either of the following:
df.shape() which gives you the length of a data frame.
or
in:df['id'].nunique()
out:5

Adding data to Pandas Dataframe in for loop

I am trying to populate a pandas dataframe from multiple dictionaries. Each of the dictionaries are in the form below:
{'Miami': {'DrPepper': '5', 'Pepsi': '8'}}
{'Atlanta:{'DrPepper':'10','Pepsi':'25'}}
Ultimately what I want is a dataframe that looks like this (After this I plan to use pandas to do some data transformations then output the dataframe to a tab delimited file):
DrPepper Pepsi
Miami 5 8
Atlanta 10 25

If you don't mind using an additional library, you can use toolz.merge to combine all of the dictionaries, followed by DataFrame.from_dict:
import toolz
d1 = {'Miami': {'DrPepper': '5', 'Pepsi': '8'}}
d2 = {'Atlanta': {'DrPepper': '10', 'Pepsi': '25'}}
df = pd.DataFrame.from_dict(toolz.merge(d1, d2), orient='index')
This method assumes that you don't have repeat index values (i.e city names). If you do, the repeats will be overwritten with the last one in the list of dictionaries taking precedence.
The resulting output:
DrPepper Pepsi
Atlanta 10 25
Miami 5 8

You can use concat DataFrames created from dict by DataFrame.from_dict:
d1 = {'Miami': {'DrPepper': '5', 'Pepsi': '8'}}
d2 = {'Atlanta':{'DrPepper':'10','Pepsi':'25'}}
print (pd.DataFrame.from_dict(d1, orient='index'))
Pepsi DrPepper
Miami 8 5
print (pd.concat([pd.DataFrame.from_dict(d1, orient='index'),
pd.DataFrame.from_dict(d2, orient='index')]))
Pepsi DrPepper
Miami 8 5
Atlanta 25 10
Another solution with transpose by T:
print (pd.DataFrame(d1))
Miami
DrPepper 5
Pepsi 8
print (pd.concat([pd.DataFrame(d1).T, pd.DataFrame(d2).T]))
DrPepper Pepsi
Miami 5 8
Atlanta 10 25
Is possible use list comprehension also:
L = [d1,d2]
print (pd.concat([pd.DataFrame(d).T for d in L]))
DrPepper Pepsi
Miami 5 8
Atlanta 10 25

Manipulate CSV file 1 column into multiple - NFL scores

Working on an NFL CSV file that can help me automate scoring for games. Right now, I can upload the teams and scores into ONLY 1 column of the csv file.
THESE ARE ALL IN COLUMN A
Example:
A
1 NYJ
2 27
3 PHI
4 20
5 BUF
6 13
7 DET
8 35
9 CIN
10 27
11 IND
12 10
13 MIA
14 24
15 NO
16 21
OR
[['NYJ`'], ['27'], ['PHI'], ['20'], ['BUF'], ['13'], ['DET'], ['35'], ['CIN'], ['27'], ['IND'], ['10'], ['MIA'], ['24'], ['NO'], ['21'], ['TB'], ['12'], ['WAS'], ['30'], ['CAR'], ['25'], ['PIT'], ['10'], ['ATL'], ['16'], ['JAC'], ['20'], ['NE'], ['28'], ['NYG'], ['20'], ['MIN'], ['24'], ['TEN'], ['23'], ['STL'], ['24'], ['BAL'], ['21'], ['CHI'], ['16'], ['CLE'], ['18'], ['KC'], ['30'], ['GB'], ['8'], ['DAL'], ['6'], ['HOU'], ['24'], ['DEN'], ['24'], ['ARI'], ['32'], ['SD'], ['6'`], ['SF'], ['41'], ['SEA'], ['22'], ['OAK'], ['6']]
What I want is this:
A B C D
1 NYJ 27 PHI 20
2 BUF 13 DET 35
3 CIN 27 IND 10
4 MIA 24 NO 21
I have read through previous articles on this and have not got it to work yet. Any ideas on this?
Any help is appreciated and thanks!
current script:
import nflgame
import csv
print "Purpose of this script is to get NFL Scores to help out with GUT"
pregames = nflgame.games(2013, week=[4], kind='PRE')
out = open("scores.csv", "wb")
output = csv.writer(out)
for score in pregames:
output.writerows([[score.home],[score.score_home],[score.away],[score.score_away]])

You're currently using .writerows() to write 4 rows, each with one column.
Instead, you want:
output.writerow([score.home, score.score_home, score.away, score.score_away])
to write a single row with 4 columns.

Without knowing the score data, try to change writerows to writerow:
import nflgame
import csv
print "Purpose of this script is to get NFL Scores to help out with GUT"
pregames = nflgame.games(2013, week=[4], kind='PRE')
out = open("scores.csv", "wb")
output = csv.writer(out)
for score in pregames:
output.writerow([[score.home],[score.score_home],[score.away],[score.score_away]])
This will output it all in one line.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to find missing values in groups - python

Try using: value = '3A' print(df.groupby('visit').filter(lambda x: all(value != i for i in x['violation']))) Output: violation visit 5 7D 3 6 8E 3

Another approach would be: violation_visits = df[df['violation']=='3A']['visit'].unique() df[~df['visit'].isin(violation_visits.tolist())] Out[16]: visit violation 5 3 7D 6 3 8E

One way using filter df.groupby('visit').filter(lambda x : ~x['violation'].eq('3A').any()) visit violation 5 3 7D 6 3 8E Another way using transform df[df.violation.ne('3A').groupby(df.visit).transform('all')] visit violation 5 3 7D 6 3 8E

Related

Formatting a 2D list using iteration - even spacing problem [duplicate]

Sorting one group of columns according to another group of columns

Pivoting count of column value using python pandas

Adding data to Pandas Dataframe in for loop

Manipulate CSV file 1 column into multiple - NFL scores

Categories

Resources