Pandas check value in column - python

I'm not used to pandas 's syntax, i'm trying to just check if a value is in a column.
i've tested this :
data = my dataframe
ini = column name
data = pdsql.read_sql_query("select id_bdcarth, id_nd_ini::int ini, id_nd_fin::int fin, v from tempturbi.tmp_somme_v19",connection)
exists = 973237173 in data.ini
print(exists)
and i get False as result (but the value is in the column). Is my method wrong ?

Related

Use headers from one dataframe to find the row indexes from a second

I'm trying to use a reference table to take action based on the headers of a file I am importing into a dataframe.
print(df_reference_table)
file_type source col_name1 col_name2 col_name3 ...
Status G081 TAIL MDS BASE ...
LIMS-EV Serial Number Mission Design Location ...
IMDS ACFT Designator CMD ...
print(df_import_table.columns.values)
['TAIL' 'MDS' 'BASE']
cols_in = df_import_table.columns.value
I'm looking for something that will return [Status, G081], the script would add/delete/rename columns as needed so they match. My source documents have different numbers of columns and I have no control over the format/names/length before it gets to me.
I've tried the following:
In:
t = df_import_table.columns.values
df_reference_table.loc[t]
Out:
KeyError:['TAIL' 'MDS' 'BASE'] not in index
In:
l = list(df_import_table.columns.values)
df_reference_table.loc[l]
Out:
KeyError:['TAIL' 'MDS' 'BASE'] not in index
In:
t = df_import_table.columns.values
df_reference_table.index[df_reference_table.columns == t].tolist()
Out:
ValueError: Lengths must match to compare
Basically, I want to do the reverse of -
df_format.loc['Status','G081'].tolist()
Use a boolean mask:
# Set 'file_type' and 'source' as index if it's not already the case
df_reference_table = df_reference_table.set_index(['file_type', 'source'])
cols = df_import_table.columns.tolist())
mask = df_reference_table.eq(cols).any(axis=1)
print(df_reference_table[mask].index.to_flat_index()[0])
# Output:
('Status', 'G081')

pandas: while loop to simultaneously advance through multiple lists and call functions

I want my code to:
read data from a CSV and make a dataframe: "source_df"
see if the dataframe contains any columns specified in a list:
"possible_columns"
call a unique function to replace the values in each column whose header is found in the "possible_columns" the list, then insert the modified values in a new dataframe: "destination_df"
Here it is:
import pandas as pd
#creates source_df
file = "yes-no-true-false.csv"
data = pd.read_csv(file)
source_df = pd.DataFrame(data)
#creates destination_df
blanklist = []
destination_df = pd.DataFrame(blanklist)
#create the column header lists for comparison in the while loop
columns = source_df.head(0)
possible_columns = ['yes/no','true/false']
#establish the functions list and define the functions to replace column values
fix_functions_list = ['yes_no_fix()','true_false_fix()']
def yes_no_fix():
destination_df['yes/no'] = destination_df['yes/no fixed'].replace("No","0").replace("Yes","1")
def true_false_fix():
destination_df['true/false'] = destination_df['true/false fixed'].replace('False', '1').replace('True', '0')
'''use the counter to call a unique function from the function list to replace the values in each column whose header is found in the "possible_columns" the list, insert the modified values in "destination_df, then advance the counter'''
counter = 0
while counter < len(possible_columns):
if possible_columns[counter] in columns:
destination_df.insert(counter, possible_columns[counter], source_df[possible_columns[counter]])
fix_functions_list[counter]
counter = counter + 1
#see if it works
print(destination_df.head(10))
When I print(destination_df), I see the unmodified column values from source_df. When I call the functions independently they work, which makes me think something is going wrong in my while loop.
Your issue is that you are trying to call a function that is stored in a list as a string.
fix_functions_list[cnt]
This will not actually run the function just access the string value.
I would try and find another way to run these functions.
def yes_no_fix():
destination_df['yes/no'] = destination_df['yes/no fixed'].replace("No","0").replace("Yes","1")
def true_false_fix():
destination_df['true/false'] = destination_df['true/false fixed'].replace('False', '1').replace('True', '0')
fix_functions_list = {0:yes_no_fix,1:true_false_fix}
and change the function calling to like below
fix_functions_list[counter]()
#creates source_df
file = "yes-no-true-false.csv"
data = pd.read_csv(file)
source_df = pd.DataFrame(data)
possible_columns = ['yes/no','true/false']
mapping_dict={'yes/no':{"No":"0","Yes":"1"} ,'true/false': {'False':'1','True': '0'}
old_columns=[if column not in possible_columns for column in source_df.columns]
existed_columns=[if column in possible_columns for column in source_df.columns]
new_df=source_df[existed_columns]
for column in new_df.columns:
new_df[column].map(mapping_dict[column])
new_df[old_columns]=source_df[old_columns]

Any differences between iterating over values in columns of dataframe and assigning variable to data in column?

I ran the following codes but Spyder returned "float division by zero"
import pandas as pd
file = pd.read_csv(r"data_ET.csv")
def normalise(df, columnName):
for value in df[columnName]:
df[columnName] = (value - df[columnName].min())/(df[columnName].max()-df[columnName].min()) (This line showed up error)
return df[columnName]
#b)
normalised_RTfirstpass = normalise(file, 'RTfirstpass')
normalised_logfreq = normalise(file, 'log_freq')
file['normalised RTfirstpass'] = normalised_RTfirstpass
file['normalised logfreq'] = normalised_logfreq
print(file)
When I changed into this, it works (the change here is assigning column values to a variable)
import pandas as pd
file = pd.read_csv(r"data_ET.csv")
def normalise(df, columnName):
value=df[columnName]
df[columnName] = (value - df[columnName].min())/(df[columnName].max()-df[columnName].min())
return df[columnName]
#b)
normalised_RTfirstpass = normalise(file, 'RTfirstpass')
normalised_logfreq = normalise(file, 'log_freq')
file['normalised RTfirstpass'] = normalised_RTfirstpass
file['normalised logfreq'] = normalised_logfreq
print(file)
Can anybody explain why the later works but the former does not?
df[Columnname] returns a pd.Series object and you are trying cast a Series object into int.
while in latter case, value=df[ColumnName], df[ColumnName]-df[ColumnName].min()/(....),
the pandas will broadcast the df[ColumnName].min() (which is a int/float value) into a pd.Series object. pandas automatically performs matrix operation on dataframe, thats why you do not need to iterate for every value in column.

How to iterate over a CSV file with Pywikibot

I wanted to try uploading a series of items to test.wikidata, creating the item and then adding a statement of inception P571. The csv file sometimes has a date value, sometimes not. When no date value is given, I want to write out a placeholder 'some value'.
Imagine a dataframe like this:
df = {'Object': [1, 2,3], 'Date': [250,,300]}
However, I am not sure using Pywikibot how to iterate over a csv file with pywikibot to create an item for each row and add a statement. Here is the code I wrote:
import pywikibot
import pandas as pd
site = pywikibot.Site("test", "wikidata")
repo = site.data_repository()
df = pd.read_csv('experiment.csv')
item = pywikibot.ItemPage(repo)
for item in df:
date = df['date']
prop_date = pywikibot.Claim(repo, u'P571')
if date=='':
prop_date.setSnakType('somevalue')
else:
target = pywikibot.WbTime(year=date)
prop_date.setTarget(target)
item.addClaim(prop_date)
When I run this through PAWS, I get the message: KeyError: 'date'
But I think the real issue here is that I am not sure how to get Pywikibot to iterate over each row of the dataframe and create a new claim for each new date value. I would value any feedback or suggestions for good examples and documentation. Many thanks!
Looking back on this, the solution was to use .iterrows() or .itertuples() or .loc[] to access the values in the row.
So
for item in df.itertuples():
prop_date = pywikibot.Claim(repo, u'P571')
if item.date=='':
prop_date.setSnakType('somevalue')
else:
target = pywikibot.WbTime(year=date)
prop_date.setTarget(target)
item.addClaim(prop_date)

How to build a dataframe from scratch while filling in missing data? (details included in question)

I have a dataframe which looks like the following (Name of the first dataframe(image below) is relevantdata in the code):
I want the dataframe to be transformed to the following format:
Essentially, I want to get the relevant confirmed number for each Key for all the dates that are available in the dataframe. If a particular date is not available for a Key, we make that value to be zero.
Currently my code is as follows (A try/except block is used as some Keys don't have the the whole range of dates, hence a Keyerror occurs the first time you refer to that date using countrydata.at[date,'Confirmed'] for the respective Key, hence the except block will make an entry of zero into the dictionary for that date):
relevantdata = pandas.read_csv('https://raw.githubusercontent.com/open-covid-19/data/master/output/data_minimal.csv')
dates = relevantdata['Date'].unique().tolist()
covidcountries = relevantdata['Key'].unique().tolist()
data = dict()
data['Country'] = covidcountries
confirmeddata = relevantdata[['Date','Key','Confirmed']]
for country in covidcountries:
for date in dates:
countrydata = confirmeddata.loc[lambda confirmeddata: confirmeddata['Key'] == country].set_index('Date')
try:
if (date in data.keys()) == False:
data[date] = list()
data[date].append(countrydata.at[date,'Confirmed'])
else:
data[date].append(countrydata.at[date,'Confirmed'])
except:
if (date in data.keys()) == False:
data[date].append(0)
else:
data[date].append(0)
finaldf = pandas.DataFrame(data = data)
While the above code accomplished what I want in getting the dataframe in the format I require, it is way too slow, having to loop through every key and date. I want to know if there is a better and faster method to doing the same without having to use a nested for loop. Thank you for all your help.

Categories

Resources