How to delete rows in a CSV file based on blank columns - python

I have a csv file that is in this format, but has thousands of rows so I can summarize it like this
id,name,score1,score2,score3
1,,3.0,4.5,2.0
2,,,,
3,,4.5,3.2,4.1
I have tried to use .dropna() but that is not working.
My desired output is
id,name,score1,score2,score3
1,,3.0,4.5,2.0
3,,4.5,3.2,4.1
All I would really need is to check if score1 is empty because if score1 is empty then the rest of the scores are empty as well.
I have also tried this but it doesn't seem to do anything.
import pandas as pd
df = pd.read_csv('dataset.csv')
df.drop(df.index[(df["score1] == '')], axis=0,inplace=True)
df.to_csv('new.csv')
Can anyone help with this?

After seeing your edits, I realized that dropna doesn't work for you because you have a None value in all of your rows. To filter for nan values in a specific column, I would recommend using the apply function like in the following code. (Btw the StackOverflow.csv is just a file where I copied and pasted your data from the question)
import pandas as pd
import math
df = pd.read_csv("StackOverflow.csv", index_col="id")
#Function that takes a number and returns if its nan or not
def not_nan(number):
return not math.isnan(number)
#Filtering the dataframe with the function
df = df[df["score1"].apply(not_nan)]
What this does is iterate through the score1 row and check if a value is NaN or not. If it is, then it returns False. We then use the list of True and False values to filter out the values from the dataframe.

import pandas as pd
df = pd.DataFrame([[1,3.0,4.5,2.0],[2],[3,4.5,3.2,4.1]], columns=["id","score1","score2","score3"])
aux1 = df.dropna()
aux2 = df.dropna(axis='columns')
aux3 = df.dropna(axis='rows')
print('=== original ===')
print(df)
print()
print('=== mode 1 ===')
print(aux1)
print()
print('=== mode 2 ===')
print(aux2)
print()
print('=== mode 3 ===')
print(aux3)
print()
print('=== mode 4 ===')
print('drop original')
df.dropna(axis=1,inplace=True)
print(df)

Related

Pandas Conditional formatting by comparing the column values of dataframe

import io
import pandas as pd
csv_data = '''App_name,pre-prod,prod,stage
matching-image,nginx,nginx,nginx
mismatching-image,nginx,nginx,nginx:1.23.3-alpine'''
df = pd.read_csv(io.StringIO(csv_data), sep=",")
html_table = df.tohtml()
Is there a way to compare the values of columns in dataframe and use it in conditional formatting ? I want compare if the 'prod','pre-prod' and 'stage' values are mismatching, if yes then then its bg-color should be red. I have tired the following methods present in pandas but none of them works.
df.style.apply()
df.style.apply_index()
df.style.applymap()
Current Output:
Desired output:
You can add style conditionally by applying style to a subset of your dataframe like:
import io
import pandas as pd
csv_data = '''App_name,pre-prod,prod,stage
matching-image,nginx,nginx,nginx
mismatching-image,nginx,nginx,nginx:1.23.3-alpine'''
def add_color(row):
return ['background-color: red'] * len(row)
df = pd.read_csv(io.StringIO(csv_data), sep=",")
df.loc[(df["pre-prod"] == df["prod"]) & (df["prod"] == df["stage"])].style.apply(add_color, axis=1)
import io
import pandas as pd
csv_data = '''
App_name,pre-prod,prod,stage
matching-image,nginx,nginx,nginx
matching-image,nginx,nginx,nginx
mismatching-image,nginx,nginx,nginx:1.23.3-alpine
mismatching-image,nginx,nginx,nginx:1.23.3-alpine
'''
df = pd.read_csv(io.StringIO(csv_data), sep=",")
def match_checker(row):
if row['prod'] == row['pre-prod'] == row['stage']:
return [''] * len(row)
else:
return ['background-color: red'] * len(row)
df = df.style.apply(match_checker, axis=1)
html_table = df.to_html()
with open('testpandas.html','w+') as html_file:
html_file.write(html_table)
html_file.close()
Updated #PeterSmith answer.
It's also possible to style the entire DataFrame in one go by passing axis=None to apply.
We can identify rows which have differing values in the specified columns by comparing the first column (column 0) with the remaining columns (column 1-2) and identifying where there are unequal values using ne on axis=0.
df[['prod', 'stage']].ne(df['pre-prod'], axis=0)
# prod stage
# 0 False False
# 1 False True
Then we can check across rows for any rows which have any True values (meaning there is something that's not equal in the row).
df[['prod', 'stage']].ne(df['pre-prod'], axis=0).any(axis=1)
# 0 False
# 1 True
# dtype: bool
We can then simply apply the styles anywhere there's a True value in the resulting Series.
Altogether this could look something like:
def colour_rows_that_dont_match(df_: pd.DataFrame, comparison_cols: List[str]):
# Sanity check that comparison_cols is what we expect
assert isinstance(comparison_cols, list) and len(comparison_cols) > 1, \
'Must be a list and provide at least 2 column to compare'
# Create an Empty DataFrame to hold styles of the same shape as the original df
styles_df = pd.DataFrame('', index=df_.index, columns=df_.columns)
# Compare the first column's (col 0) values to the remaining columns.
# Find rows where any values are not equal (ne)
rows_that_dont_match = df[comparison_cols[1:]].ne(df[comparison_cols[0]], axis=0).any(axis=1)
# Apply styles to rows which meet the above criteria
styles_df.loc[rows_that_dont_match, :] = 'background-color: red'
return styles_df
df.style.apply(
colour_rows_that_dont_match,
# This gets passed to the function
comparison_cols=['pre-prod', 'prod', 'stage'],
# Apply to the entire DataFrame at once
axis=None
).to_html(buf='test_df.html')
Which produces the following:
Setup, version, and imports:
from typing import List
import pandas as pd # version 1.5.2
df = pd.DataFrame({
'App_name': ['matching-image', 'mismatching-image'],
'pre-prod': ['nginx', 'nginx'],
'prod': ['nginx', 'nginx'],
'stage': ['nginx', 'nginx:1.23.3-alpine']
})

How to drop duplicates ignoring one column

I have a DataFrame with multiple columns and the last column is timestamp which I want Python to ignore. I've used drop_columns(subset=...) but does not work as it returns literally the same DataFrame.
This is what the DataFrame looks like:
id
name
features
timestamp
1
34233
Bob
athletics
04-06-2022
2
23423
John
mathematics
03-06-2022
3
34233
Bob
english_literature
06-06-2022
4
23423
John
mathematics
10-06-2022
...
...
...
...
...
And this is are the data types when doing df.dtypes:
id
int64
name
object
features
object
timestamp
object
Lastly, this is the piece of code I used:
df.drop_duplicates(subset=df.columns.tolist().remove("timestamp"), keep="first").reset_index(drop=True)
The idea is to keep track of changes based on a timestamp IF there are changes to the other columns. For instance, I don't want to keep row 4 because nothing has changed with John, however, I want to keep Bob as it has changed from athletics to english_literature. Does that make sense?
EDIT:
This is the full code:
"""
db_data contains 10 records
new_data contains 12 records but I know only 5 are needed based on the logic I want to implement
"""
db_data = pd.read_sql("SELECT * FROM subscribed", engine)
new_data = pd.read_csv("new_data.csv")
# Checking columns match
# This prints "matching"
if db_data.columns == new_data.columns: print("matching")
df = pd.concat([db_data, new_data], axis=1)
consider = [x for x in df.columns if x != "timestamp"]
df = df.drop_duplicates(subset=consider).reset_index(drop=True)
# This outputs 22 but should have printed 15
print(len(df))
TEST:
I've done a test but has puzzled me even more. I've created a separate table in the db and loaded the csv file new_data.csv and then used read_sql to get it back into a DataFrame. Surprisingly, this works. However, I do not want to take this unnecessary extra step. I am puzzled on why this works. I've checked the data types they match.
db_data = pd.read_sql("SELECT * FROM subscribed, engine")
new_data = pd.read_sql("SELECT * FROM test, engine")
# Checking columns match
# This still prints "matching"
if db_data.columns == new_data.columns: print("matching")
df = pd.concat([db_data, new_data], axis=1)
consider = [x for x in df.columns if x != "timestamp"]
df = df.drop_duplicates(subset=consider).reset_index(drop=True)
# This the right output... in other words, it worked.
print(len(df))
The remove method of a list returns None. That's why the returned dataframe is similar. You can do as follows:
Create the list of columns for the subset: col_subset = df.columns.tolist()
Remove timestamp: col_subset.remove('timestamp')
Use the col_subset list in the drop_duplicates() function: df.drop_duplicates(subset=col_subset, keep="first").reset_index(drop=True)
Try this:
consider = [x for x in df.columns if x != "timestamp"]
df.drop_duplicates(subset=consider).reset_index(drop=True)
(You don't need tolist() and keep="first" here)
If I understood you correctly, this code would do:
df.drop_duplicates(subset='features', keep ='first').reset_index()

Compare entire rows for equality if some condition is satisfied

Let's say I have the following data of a match in a CSV file:
name,match1,match2,match3
Alice,2,4,3
Bob,2,3,4
Charlie,1,0,4
I'm writing a python program. Somewhere in my program I have scores collected for a match stored in a list, say x = [1,0,4]. I have found where in the data these scores exist using pandas and I can print "found" or "not found". However I want my code to print out to which name these scores correspond to. In this case the program should output "charlie" since charlie has all these values [1,0,4]. how can I do that?
I will have a large set of data so I must be able to tell which name corresponds to the numbers I pass to the program.
Yes, here's how to compare entire rows in a dataframe:
df[(df == x).all(axis=1)].index # where x is the pd.Series we're comparing to
Also, it makes life easiest if you directly set name as the index column when you read in the CSV.
import pandas as pd
from io import StringIO
df = """\
name,match1,match2,match3
Alice,2,4,3
Bob,2,3,4
Charlie,1,0,4"""
df = pd.read_csv(StringIO(df), index_col='name')
x = pd.Series({'match1':1, 'match2':0, 'match3':4})
Now you can see that doing df == x, or equivalently df.eq(x), is not quite what you want because it does element-wise compare and returns a row of True/False. So you need to aggregate those rows with .all(axis=1) which finds rows where all comparison results were True...
df.eq(x).all(axis=1)
df[ (df == x).all(axis=1) ]
# match1 match2 match3
# name
# Charlie 1 0 4
...and then finally since you only want the name of such rows:
df[ (df == x).all(axis=1) ].index
# Index(['Charlie'], dtype='object', name='name')
df[ (df == x).all(axis=1) ].index.tolist()
# ['Charlie']
which is what you wanted. (I only added the spaces inside the expression for clarity).
You need to use DataFrame.loc which would work like this:
print(df.loc[(df.match1 == 1) & (df.match2 == 0) & (df.match3 == 4), 'name'])
Maybe try something like this:
import pandas as pd
import numpy as np
# Makes sample data
match1 = np.array([2,2,1])
match2 = np.array([4,4,0])
match3 = np.array([3,3,4])
name = np.array(['Alice','Bob','Charlie'])
df = pd.DataFrame({'name': id, 'match1': match1, 'match2':match2, 'match3' :match3})
df
# example of the list you want to get the data from
x=[1,0,4]
#x=[2,4,3]
# should return the name Charlie as well as the index (based on the values in the list x)
df['name'].loc[(df['match1'] == x[0]) & (df['match2'] == x[1]) & (df['match3'] ==x[2])]
# Makes a new dataframe out of the above
mydf = pd.DataFrame(df['name'].loc[(df['match1'] == x[0]) & (df['match2'] == x[1]) & (df['match3'] ==x[2])])
# Loop that prints out the name based on the index of mydf
# Assuming there are more than one name, it will print all. if there is only one name, it will print only that)
for i in range(0,len(mydf)):
print(mydf['name'].iloc[i])
you can use this
here data is your Data frame ,you can change accordingly your data frame name,
and
considering [1,0,4] is int type
data = data[(data['match1']== 1)&(data['match2']==0)&(data['match3']== 4 ).index
print(data[0])
if data is object type then use this
data = data[(data['match1']== "1")&(data['match2']=="0")&(data['match3']== "4" ).index
print(data[0])

Pandas Correction Previous Row

I have dataframe like this.
import pandas as pd
#create dataframe
df= pd.DataFrame({"Date":range(0,22),
"Country":["USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA",],
"Number":[0,0,0,0,0,1,1,3,5,6,4,6,7,8,7,10,25,50,75,60,45,100]
"Number is Corrected":[0,0,0,0,0,1,1,3,5,6,6,6,7,7,7,10,25,50,50,60,60,100]})
But this dataframe is have a problem. Some numbers are wrong.
Previous number always has to be smaller than next number(6,4,6,,7,8,7...50,75,60,45,100)
I don't use df.sort because it's not about sorting it's about correction.
Edit: I added corrected numbers in "number is corrected" column.
guessing from your 'Number corrected' list, you could probably use this:
import pandas as pd
#create dataframe
df= pd.DataFrame({"Date":range(0,22),
"Country":["USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA",],
"Number":[0,0,0,0,0,1,1,3,5,6,4,6,7,8,7,10,25,50,75,60,45,100]})
# "Number is Corrected":[0,0,0,0,0,1,1,3,5,6,6,6,7,7,7,10,25,50,50,60,60,100]})
def correction():
df['Number is Corrected'] = df['Number']
cache = 0
for num, content in enumerate(df['Number is Corrected'], start=0):
if(df['Number is Corrected'][num] < cache):
df['Number is Corrected'][num] = cache
else:
cache = df['Number is Corrected'][num]
print(df)
if __name__ == "__main__":
correction()
but there is some inconsistency, like your conversation with jezrael. Evtl. you'll need to update the logic of the code, if it gets clearer, what the output you wished. Good luck.

JSON File getting output as a dictionary for every row and need to create a DataFrame from it

I have a .json file and when I convert it into a Data frame by -
df = pd.read_json('tummy.json')
The output looks like -
results
0 {u'objectId': u'06Dig7sXhU', u'SpecialProperti...'
1 {u'objectId': u'07VO1j4gVC', u'SpecialProperti...'
Every row seems to be a dictionary itself. I want to extract every row and create a Data Frame out of it. I would really appreciate some help on how to proceed.
IIUC you can use:
import pandas as pd
s = pd.Series(( {u'objectId': u'06Dig7sXhU', u'SpecialProperties': u'456456'},
{u'objectId': u'07VO1j4gVC', u'SpecialProperties': u'878421'}))
df = pd.DataFrame({'results':s})
print df
results
0 {u'objectId': u'06Dig7sXhU', u'SpecialProperti...
1 {u'objectId': u'07VO1j4gVC', u'SpecialProperti...
print pd.DataFrame([x for x in df['results']], index=df.index)
SpecialProperties objectId
0 456456 06Dig7sXhU
1 878421 07VO1j4gVC

Categories

Resources