Pandas check value in column

Pandas check value in column - python

I'm not used to pandas 's syntax, i'm trying to just check if a value is in a column.
i've tested this :
data = my dataframe
ini = column name
data = pdsql.read_sql_query("select id_bdcarth, id_nd_ini::int ini, id_nd_fin::int fin, v from tempturbi.tmp_somme_v19",connection)
exists = 973237173 in data.ini
print(exists)
and i get False as result (but the value is in the column). Is my method wrong ?

Related

Use headers from one dataframe to find the row indexes from a second

I'm trying to use a reference table to take action based on the headers of a file I am importing into a dataframe.
print(df_reference_table)
file_type source col_name1 col_name2 col_name3 ...
Status G081 TAIL MDS BASE ...
LIMS-EV Serial Number Mission Design Location ...
IMDS ACFT Designator CMD ...
print(df_import_table.columns.values)
['TAIL' 'MDS' 'BASE']
cols_in = df_import_table.columns.value
I'm looking for something that will return [Status, G081], the script would add/delete/rename columns as needed so they match. My source documents have different numbers of columns and I have no control over the format/names/length before it gets to me.
I've tried the following:
In:
t = df_import_table.columns.values
df_reference_table.loc[t]
Out:
KeyError:['TAIL' 'MDS' 'BASE'] not in index
In:
l = list(df_import_table.columns.values)
df_reference_table.loc[l]
Out:
KeyError:['TAIL' 'MDS' 'BASE'] not in index
In:
t = df_import_table.columns.values
df_reference_table.index[df_reference_table.columns == t].tolist()
Out:
ValueError: Lengths must match to compare
Basically, I want to do the reverse of -
df_format.loc['Status','G081'].tolist()

Use a boolean mask:
# Set 'file_type' and 'source' as index if it's not already the case
df_reference_table = df_reference_table.set_index(['file_type', 'source'])
cols = df_import_table.columns.tolist())
mask = df_reference_table.eq(cols).any(axis=1)
print(df_reference_table[mask].index.to_flat_index()[0])
# Output:
('Status', 'G081')

pandas: while loop to simultaneously advance through multiple lists and call functions

I want my code to:
read data from a CSV and make a dataframe: "source_df"
see if the dataframe contains any columns specified in a list:
"possible_columns"
call a unique function to replace the values in each column whose header is found in the "possible_columns" the list, then insert the modified values in a new dataframe: "destination_df"
Here it is:
import pandas as pd
#creates source_df
file = "yes-no-true-false.csv"
data = pd.read_csv(file)
source_df = pd.DataFrame(data)
#creates destination_df
blanklist = []
destination_df = pd.DataFrame(blanklist)
#create the column header lists for comparison in the while loop
columns = source_df.head(0)
possible_columns = ['yes/no','true/false']
#establish the functions list and define the functions to replace column values
fix_functions_list = ['yes_no_fix()','true_false_fix()']
def yes_no_fix():
destination_df['yes/no'] = destination_df['yes/no fixed'].replace("No","0").replace("Yes","1")
def true_false_fix():
destination_df['true/false'] = destination_df['true/false fixed'].replace('False', '1').replace('True', '0')
'''use the counter to call a unique function from the function list to replace the values in each column whose header is found in the "possible_columns" the list, insert the modified values in "destination_df, then advance the counter'''
counter = 0
while counter < len(possible_columns):
if possible_columns[counter] in columns:
destination_df.insert(counter, possible_columns[counter], source_df[possible_columns[counter]])
fix_functions_list[counter]
counter = counter + 1
#see if it works
print(destination_df.head(10))
When I print(destination_df), I see the unmodified column values from source_df. When I call the functions independently they work, which makes me think something is going wrong in my while loop.

Your issue is that you are trying to call a function that is stored in a list as a string.
fix_functions_list[cnt]
This will not actually run the function just access the string value.
I would try and find another way to run these functions.

def yes_no_fix():
destination_df['yes/no'] = destination_df['yes/no fixed'].replace("No","0").replace("Yes","1")
def true_false_fix():
destination_df['true/false'] = destination_df['true/false fixed'].replace('False', '1').replace('True', '0')
fix_functions_list = {0:yes_no_fix,1:true_false_fix}
and change the function calling to like below
fix_functions_list[counter]()

#creates source_df
file = "yes-no-true-false.csv"
data = pd.read_csv(file)
source_df = pd.DataFrame(data)
possible_columns = ['yes/no','true/false']
mapping_dict={'yes/no':{"No":"0","Yes":"1"} ,'true/false': {'False':'1','True': '0'}
old_columns=[if column not in possible_columns for column in source_df.columns]
existed_columns=[if column in possible_columns for column in source_df.columns]
new_df=source_df[existed_columns]
for column in new_df.columns:
new_df[column].map(mapping_dict[column])
new_df[old_columns]=source_df[old_columns]

Any differences between iterating over values in columns of dataframe and assigning variable to data in column?

I ran the following codes but Spyder returned "float division by zero"
import pandas as pd
file = pd.read_csv(r"data_ET.csv")
def normalise(df, columnName):
for value in df[columnName]:
df[columnName] = (value - df[columnName].min())/(df[columnName].max()-df[columnName].min()) (This line showed up error)
return df[columnName]
#b)
normalised_RTfirstpass = normalise(file, 'RTfirstpass')
normalised_logfreq = normalise(file, 'log_freq')
file['normalised RTfirstpass'] = normalised_RTfirstpass
file['normalised logfreq'] = normalised_logfreq
print(file)
When I changed into this, it works (the change here is assigning column values to a variable)
import pandas as pd
file = pd.read_csv(r"data_ET.csv")
def normalise(df, columnName):
value=df[columnName]
df[columnName] = (value - df[columnName].min())/(df[columnName].max()-df[columnName].min())
return df[columnName]
#b)
normalised_RTfirstpass = normalise(file, 'RTfirstpass')
normalised_logfreq = normalise(file, 'log_freq')
file['normalised RTfirstpass'] = normalised_RTfirstpass
file['normalised logfreq'] = normalised_logfreq
print(file)
Can anybody explain why the later works but the former does not?

df[Columnname] returns a pd.Series object and you are trying cast a Series object into int.
while in latter case, value=df[ColumnName], df[ColumnName]-df[ColumnName].min()/(....),
the pandas will broadcast the df[ColumnName].min() (which is a int/float value) into a pd.Series object. pandas automatically performs matrix operation on dataframe, thats why you do not need to iterate for every value in column.

How to iterate over a CSV file with Pywikibot

I wanted to try uploading a series of items to test.wikidata, creating the item and then adding a statement of inception P571. The csv file sometimes has a date value, sometimes not. When no date value is given, I want to write out a placeholder 'some value'.
Imagine a dataframe like this:
df = {'Object': [1, 2,3], 'Date': [250,,300]}
However, I am not sure using Pywikibot how to iterate over a csv file with pywikibot to create an item for each row and add a statement. Here is the code I wrote:
import pywikibot
import pandas as pd
site = pywikibot.Site("test", "wikidata")
repo = site.data_repository()
df = pd.read_csv('experiment.csv')
item = pywikibot.ItemPage(repo)
for item in df:
date = df['date']
prop_date = pywikibot.Claim(repo, u'P571')
if date=='':
prop_date.setSnakType('somevalue')
else:
target = pywikibot.WbTime(year=date)
prop_date.setTarget(target)
item.addClaim(prop_date)
When I run this through PAWS, I get the message: KeyError: 'date'
But I think the real issue here is that I am not sure how to get Pywikibot to iterate over each row of the dataframe and create a new claim for each new date value. I would value any feedback or suggestions for good examples and documentation. Many thanks!

Looking back on this, the solution was to use .iterrows() or .itertuples() or .loc[] to access the values in the row.
So
for item in df.itertuples():
prop_date = pywikibot.Claim(repo, u'P571')
if item.date=='':
prop_date.setSnakType('somevalue')
else:
target = pywikibot.WbTime(year=date)
prop_date.setTarget(target)
item.addClaim(prop_date)

How to build a dataframe from scratch while filling in missing data? (details included in question)

I have a dataframe which looks like the following (Name of the first dataframe(image below) is relevantdata in the code):
I want the dataframe to be transformed to the following format:
Essentially, I want to get the relevant confirmed number for each Key for all the dates that are available in the dataframe. If a particular date is not available for a Key, we make that value to be zero.
Currently my code is as follows (A try/except block is used as some Keys don't have the the whole range of dates, hence a Keyerror occurs the first time you refer to that date using countrydata.at[date,'Confirmed'] for the respective Key, hence the except block will make an entry of zero into the dictionary for that date):
relevantdata = pandas.read_csv('https://raw.githubusercontent.com/open-covid-19/data/master/output/data_minimal.csv')
dates = relevantdata['Date'].unique().tolist()
covidcountries = relevantdata['Key'].unique().tolist()
data = dict()
data['Country'] = covidcountries
confirmeddata = relevantdata[['Date','Key','Confirmed']]
for country in covidcountries:
for date in dates:
countrydata = confirmeddata.loc[lambda confirmeddata: confirmeddata['Key'] == country].set_index('Date')
try:
if (date in data.keys()) == False:
data[date] = list()
data[date].append(countrydata.at[date,'Confirmed'])
else:
data[date].append(countrydata.at[date,'Confirmed'])
except:
if (date in data.keys()) == False:
data[date].append(0)
else:
data[date].append(0)
finaldf = pandas.DataFrame(data = data)
While the above code accomplished what I want in getting the dataframe in the format I require, it is way too slow, having to loop through every key and date. I want to know if there is a better and faster method to doing the same without having to use a nested for loop. Thank you for all your help.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas check value in column - python

Related

Use headers from one dataframe to find the row indexes from a second

pandas: while loop to simultaneously advance through multiple lists and call functions

Any differences between iterating over values in columns of dataframe and assigning variable to data in column?

How to iterate over a CSV file with Pywikibot

How to build a dataframe from scratch while filling in missing data? (details included in question)

Categories

Resources