Convert string variables into ints in a dataset

Convert string variables into ints in a dataset - python

I'm trying to convert values from strings to ints in a certain column of a dataset. I tried using a for loop and even though the loop does seem to be iterating through the data it's failing to convert any of the variables. I'm certain that I'm making a super basic mistake but can't figure it out as I'm very new at this.
I downloaded a data file from https://www.kaggle.com/datasets/majunbajun/himalayan-climbing-expeditions
Then proceeded to process the data so that I can analyse it statistically.
Here's the start of the code
#import pandas
import pandas as pd
#import expeditions as csv file
exp = pd.read_csv('C:\\file\\path\\to\\expeditions.csv')
#create subset for success vs failure
exp_win_v_fail = exp[['termination_reason', 'basecamp_date', 'season']]
#drop successes in dispute
exp_win_v_fail = exp_win_v_fail[(exp_win_v_fail['termination_reason'] != 'Success (claimed)') & (exp_win_v_fail['termination_reason'] != 'Attempt rumoured')]
This is the part I can't figure out
#recode termination reason to be binary
for element in exp_win_v_fail['termination_reason']:
if element == 'Success (main peak)':
element = 1
elif element == 'Success (subpeak)':
element = 1
else:
element = 0
Any help would be very much appreciated

To replace all values beginning with 'Success' with 1, and all other values to 0:
from pandas import read_csv
RE = '^Success.*$'
NRE = '^((?!Success).)*$'
TR = 'termination_reason'
BD = 'basecamp_date'
SE = 'season'
data = read_csv('expeditions.csv')
exp_win_v_fail = data[[TR, BD, SE]]
for v, re_ in enumerate((NRE, RE)):
exp_win_v_fail[TR] = exp_win_v_fail[TR].replace(to_replace=re_, value=v, regex=True)
for e in exp_win_v_fail[TR]:
print(e)

Related

Looking for a more elegant and sophisticated solution when multiple if and for-loop are used

I am beginner/intermediate user working with python and when I write elaborate code (at least for me), I always try to rewrite it looking for reducing the number of lines when possible.
Here the code I have written.
It is basically read all values of one data frame looking for a specific string, if string found save index and value in a dictionary and drop rows where these string was found. And the same with next string...
##### Reading CSV file values and looking for variants IDs ######
# Find Variant ID (rs000000) in CSV
# \d+ is neccesary in case the line find a rs+something. rs\d+ looks for rs+ numbers
rs = df_draft[df_draft.apply(lambda x:x.str.contains("rs\d+"))].dropna(how='all').dropna(axis=1, how='all')
# Now, we save the results found in a dict key=index and value=variand ID
if rs.empty == False:
ind = rs.index.to_list()
vals = list(rs.stack().values)
row2rs = dict(zip(ind, vals))
print(row2rs)
# We need to remove the row where rs has been found.
# Because if in the same row more than one ID variant found (i.e rs# and NM_#)
# this code is going to get same variant more than one.
for index, rs in row2rs.items():
# Rows where substring 'rs' has been found need to be delete to avoid repetition
# This will be done in df_draft
df_draft = df_draft.drop(index)
## Same thing with other ID variants
# Here with Variant ID (NM_0000000) in CSV
NM = df_draft[df_draft.apply(lambda x:x.str.contains("NM_\d+"))].dropna(how='all').dropna(axis=1, how='all')
if NM.empty == False:
ind = NM.index.to_list()
vals = list(NM.stack().values)
row2NM = dict(zip(ind, vals))
print(row2NM)
for index, NM in row2NM.items():
df_draft = df_draft.drop(index)
# Here with Variant ID (NP_0000000) in CSV
NP = df_draft[df_draft.apply(lambda x:x.str.contains("NP_\d+"))].dropna(how='all').dropna(axis=1, how='all')
if NP.empty == False:
ind = NP.index.to_list()
vals = list(NP.stack().values)
row2NP = dict(zip(ind, vals))
print(row2NP)
for index, NP in row2NP.items():
df_draft = df_draft.drop(index)
# Here with ClinVar field (RCV#) in CSV
RCV = df_draft[df_draft.apply(lambda x:x.str.contains("RCV\d+"))].dropna(how='all').dropna(axis=1, how='all')
if RCV.empty == False:
ind = RCV.index.to_list()
vals = list(RCV.stack().values)
row2RCV = dict(zip(ind, vals))
print(row2RCV)
for index, NP in row2NP.items():
df_draft = df_draft.drop(index)
I was wondering for a more elegant solution of writing this simple but long code.
I have been thinking of sa

using pandas to find the string from a column

I am a very beginner in programming and trying to learn to code. so please bear with my bad coding. I am using pandas to find a string from a column (Combinations column in the below code ) in the data frame and print the entire row containing the string . Find the code below. Basically I need to find all the instances where the string occurs , and print the entire row .find my code below . I am not able to figure out how to find that particular instance of the column and print it .
import pandas as pd
data = pd.read_csv("signallervalues.csv",index_col=False)
data.head()
data['col1'] = data['col1'].astype(str)
data['col2'] = data['col2'].astype(str)
data['col3'] = data['col3'].astype(str)
data['col4'] = data['col4'].astype(str)
data['col5']= data['col5'].astype(str)
data.head()
combinations= data['Col1']+data['col2'] + data['col3'] + data['col4'] + data['col5']
data['combinations']= combinations
print(data.head())
list_of_combinations = data['combinations'].to_list()
print(list_of_combinations)
for i in list_of_combinations:
if data['combinations'].str.contains(i).any():
print(i+ 'data occurs in row' )
# I need to print the row containing the string here
else:
print(i +'is occuring only once')
my data frame looks like this

import pandas as pd
data=pd.DataFrame()
# recreating your data (more or less)
data['signaller']= pd.Series(['ciao', 'ciao', 'ciao'])
data['col6']= pd.Series(['-1-11-11', '11', '-1-11-11'])
list_of_combinations=['11', '-1-11-11']
data.reset_index(inplace=True)
# group by the values of column 6 and counting how many times they occur
g=data.groupby('col6')['index']
count= pd.DataFrame(g.count())
count=count.rename(columns={'index':'occurences'})
count.reset_index(inplace=True)
# create a df that keeps only the rows in the list 'list_of_combinations'
count[~count['col6'].isin(list_of_combinations)== False]
My result

Script keep showing "SettingCopyWithWarning'

Hello my problem is that my script keep showing below message
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
downcast=downcast
I Searched the google for a while regarding this, and it seems like my code is somehow
assigning sliced dataframe to new variable, which is problematic.
The problem is ** I can't find where my code get problematic **
I tried copy function, or seperated the nested functions, but it is not working
I attached my code below.
def case_sorting(file_get, col_get, methods_get, operator_get, value_get):
ops = {">": gt, "<": lt}
col_get = str(col_get)
value_get = int(value_get)
if methods_get is "|x|":
new_file = file_get[ops[operator_get](file_get[col_get], value_get)]
else:
new_file = file_get[ops[operator_get](file_get[col_get], np.percentile(file_get[col_get], value_get))]
return new_file
Basically what i was about to do was to make flask api that gets excel file as an input, and returns the csv file with some filtering. So I defined some functions first.
def get_brandlist(df_input, brand_input):
if brand_input == "default":
final_list = (pd.unique(df_input["브랜드"])).tolist()
else:
final_list = brand_input.split("/")
if '브랜드' in final_list:
final_list.remove('브랜드')
final_list = [x for x in final_list if str(x) != 'nan']
return final_list
Then I defined the main function
def select_bestitem(df_data, brand_name, col_name, methods, operator, value):
# // 2-1 // to remove unnecessary rows and columns with na values
df_data = df_data.dropna(axis=0 & 1, how='all')
df_data.fillna(method='pad', inplace=True)
# // 2-2 // iterate over all rows to find which row contains brand value
default_number = 0
for row in df_data.itertuples():
if '브랜드' in row:
df_data.columns = df_data.iloc[default_number, :]
break
else:
default_number = default_number + 1
# // 2-3 // create the list contains all the target brand names
brand_list = get_brandlist(df_input=df_data, brand_input=brand_name)
# // 2-4 // subset the target brand into another dataframe
df_data_refined = df_data[df_data.iloc[:, 1].isin(brand_list)]
# // 2-5 // split the dataframe based on the "brand name", and apply the input condition
df_per_brand = {}
df_per_brand_modified = {}
for brand_each in brand_list:
df_per_brand[brand_each] = df_data_refined[df_data_refined['브랜드'] == brand_each]
file = df_per_brand[brand_each].copy()
df_per_brand_modified[brand_each] = case_sorting(file_get=file, col_get=col_name, methods_get=methods,
operator_get=operator, value_get=value)
# // 2-6 // merge all the remaining dataframe
df_merged = pd.DataFrame()
for brand_each in brand_list:
df_merged = df_merged.append(df_per_brand_modified[brand_each], ignore_index=True)
final_df = df_merged.to_csv(index=False, sep=',', encoding='utf-8')
return final_df
And I am gonna import this function in my app.py later
I am quite new to all the coding, therefore really really sorry if my code is quite hard to understand, but I just really wanted to get rid of this annoying warning message. Thanks for help in advance :)

How to append data to a dataframe whithout overwriting?

I'm new to python but I need it for a personal project. And so I have this lump of code. The function is to create a table and update it as necessary. The problem is that the table keeps being overwritten and I don't know why. Also I'm struggling with correctly assigning the starting position of the new lines to append, and that's why total (ends up overwritten as well) and pos are there, but I haven't figured out how to correctly use them. Any tips?
import datetime
import pandas as pd
import numpy as np
total ={}
entryTable = pd.read_csv("Entry_Table.csv")
newEntries = int(input("How many new entries?\n"))
for i in range(newEntries):
ID = input ("ID?\n")
VQ = int (input ("VQ?\n"))
timeStamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
entryTable.loc[i] = [timeStamp, ID, VQ]
entryTable.to_csv("Inventory_Table.csv")
total[i] = 1
pos = sum(total.values())
print(pos)
inventoryTable = pd.read_csv("Inventory_Table.csv", index_col = 0)

Your variable 'i' runs from index 0 to the number of 'newEntries'. When you add new data to row 'i' in your Pandas dataframe, you are overwriting existing data in that row. If you want to add new data, try 'n+i' where n is the initial number of entries. You can determine n with either
n = len(entryTable)
or
n = entryTable.shape[0]

My program compute values as string and not as float even when ichange the type

i have a problem with my program and i'm confused, i don't know why it won't change the type of the columns, or maybe it is changing the type of the columns and it just still compute the columns as string. When i change the type into float, if i want it to be multiplied by 8, it will give me, for example with 4, 44444444. Here is my code.
import pandas as pd
import re
import numpy as np
link = "excelfilett.txt"
file = open(link, "r")
frames = []
is_count_frames = False
for line in file:
if "[Frames]" in line:
is_count_frames = True
if is_count_frames == True:
frames.append(line)
if "[EthernetRouting]" in line:
break
number_of_rows = len(frames) - 3
header = re.split(r'\t', frames[1])
number_of_columns = len(header)
frame_array = np.full((number_of_rows, number_of_columns), 0)
df_frame_array = pd.DataFrame(frame_array)
df_frame_array.columns= header
for row in range(number_of_rows):
frame_row = re.split(r'\t',frames[row+2])
for position in range(len(frame_row)):
df_frame_array.iloc[row, position]=frame_row[position]
df_frame_array['[MinDistance (ms)]'].astype(float)
df_frame_array.loc[:,'[MinDistance (ms)]'] *= 8
print(df_frame_array['[MinDistance (ms)]'])
but it gives me 8 times the value like (100100...100100), i also tried with puting them in a list
MinDistList = df_frame_array['[MinDistance (ms)]'].tolist()
product = []
for i in MinDistList:
product.append(i*8)
print(product)
but it still won't work, any ideas?

df_frame_array['[MinDistance (ms)]'].astype(float) doesn't change the column in place, but returns a new one.
You had the right idea, so just store it back:
df_frame_array['[MinDistance (ms)]'] = df_frame_array['[MinDistance (ms)]'].astype(float)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert string variables into ints in a dataset - python

Related

Looking for a more elegant and sophisticated solution when multiple if and for-loop are used

using pandas to find the string from a column

Script keep showing "SettingCopyWithWarning'

How to append data to a dataframe whithout overwriting?

My program compute values as string and not as float even when ichange the type

Categories

Resources