using pandas to find the string from a column - python

I am a very beginner in programming and trying to learn to code. so please bear with my bad coding. I am using pandas to find a string from a column (Combinations column in the below code ) in the data frame and print the entire row containing the string . Find the code below. Basically I need to find all the instances where the string occurs , and print the entire row .find my code below . I am not able to figure out how to find that particular instance of the column and print it .
import pandas as pd
data = pd.read_csv("signallervalues.csv",index_col=False)
data.head()
data['col1'] = data['col1'].astype(str)
data['col2'] = data['col2'].astype(str)
data['col3'] = data['col3'].astype(str)
data['col4'] = data['col4'].astype(str)
data['col5']= data['col5'].astype(str)
data.head()
combinations= data['Col1']+data['col2'] + data['col3'] + data['col4'] + data['col5']
data['combinations']= combinations
print(data.head())
list_of_combinations = data['combinations'].to_list()
print(list_of_combinations)
for i in list_of_combinations:
if data['combinations'].str.contains(i).any():
print(i+ 'data occurs in row' )
# I need to print the row containing the string here
else:
print(i +'is occuring only once')
my data frame looks like this

import pandas as pd
data=pd.DataFrame()
# recreating your data (more or less)
data['signaller']= pd.Series(['ciao', 'ciao', 'ciao'])
data['col6']= pd.Series(['-1-11-11', '11', '-1-11-11'])
list_of_combinations=['11', '-1-11-11']
data.reset_index(inplace=True)
# group by the values of column 6 and counting how many times they occur
g=data.groupby('col6')['index']
count= pd.DataFrame(g.count())
count=count.rename(columns={'index':'occurences'})
count.reset_index(inplace=True)
# create a df that keeps only the rows in the list 'list_of_combinations'
count[~count['col6'].isin(list_of_combinations)== False]
My result

Related

Convert string variables into ints in a dataset

I'm trying to convert values from strings to ints in a certain column of a dataset. I tried using a for loop and even though the loop does seem to be iterating through the data it's failing to convert any of the variables. I'm certain that I'm making a super basic mistake but can't figure it out as I'm very new at this.
I downloaded a data file from https://www.kaggle.com/datasets/majunbajun/himalayan-climbing-expeditions
Then proceeded to process the data so that I can analyse it statistically.
Here's the start of the code
#import pandas
import pandas as pd
#import expeditions as csv file
exp = pd.read_csv('C:\\file\\path\\to\\expeditions.csv')
#create subset for success vs failure
exp_win_v_fail = exp[['termination_reason', 'basecamp_date', 'season']]
#drop successes in dispute
exp_win_v_fail = exp_win_v_fail[(exp_win_v_fail['termination_reason'] != 'Success (claimed)') & (exp_win_v_fail['termination_reason'] != 'Attempt rumoured')]
This is the part I can't figure out
#recode termination reason to be binary
for element in exp_win_v_fail['termination_reason']:
if element == 'Success (main peak)':
element = 1
elif element == 'Success (subpeak)':
element = 1
else:
element = 0
Any help would be very much appreciated
To replace all values beginning with 'Success' with 1, and all other values to 0:
from pandas import read_csv
RE = '^Success.*$'
NRE = '^((?!Success).)*$'
TR = 'termination_reason'
BD = 'basecamp_date'
SE = 'season'
data = read_csv('expeditions.csv')
exp_win_v_fail = data[[TR, BD, SE]]
for v, re_ in enumerate((NRE, RE)):
exp_win_v_fail[TR] = exp_win_v_fail[TR].replace(to_replace=re_, value=v, regex=True)
for e in exp_win_v_fail[TR]:
print(e)

Set up a column based on another column and outside list in a Pandas Dataframe

I am trying to create a new column in a Pandas dataframe which takes only one array from a list of 5 arrays (the list is titled cluster_centre) and puts that array into the dataframe. It would take the array at the index that matches the value in the 'labels' column of the same dataframe (which has values of 0,1,2,3 or 4). So for instance, if the sentence in that row was given a label of 2 i.e. the 'labels' column value for that row would be 2, then the value of the 'cluster_centres' column in the df at that row would be cluster_centre[2]. How can I do this? The code I have attempted is pasted below:
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
import pandas as pd
with open('JWN_Nordstrom_MDNA_overview_2017.txt', 'r') as file:
initial_corpus = file.read()
corpus = initial_corpus.split('. ')
# Extract sentence embeddings
embedder = SentenceTransformer('bert-base-wikipedia-sections-mean-tokens')
corpus_embeddings = embedder.encode(corpus)
# Perform KMeans clustering
num_clusters = 5
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_
cluster_centre = clustering_model.cluster_centers_
# Create dataframe
All_data_df = pd.DataFrame()
All_data_df['sentences'] = corpus
All_data_df['embeddings'] = corpus_embeddings
All_data_df['labels'] = cluster_assignment
# The line below creates a ValueError
All_data_df['cluster_centres'] = cluster_centre[All_data_df['labels']]
print(All_data_df.head())
I get this error: ValueError: Wrong number of items passed 768, placement implies 1
UPDATE: I did some new stuff and tried this:
All_data_df = pd.DataFrame()
All_data_df['sentences'] = corpus
All_data_df['embeddings'] = corpus_embeddings
All_data_df['labels'] = cluster_assignment
#All_data_df['cluster_centres'] = 0
for index, row in All_data_df.iterrows():
iforval = cluster_centre[row['labels']]
All_data_df.at[index, 'cluster_centres'] = iforval
print(All_data_df.head())
But get a new error: ValueError: Must have equal len keys and value when setting with an iterable. I printed iforval inside the loop and it does indeed return 29 correct arrays from the cluster_centre list, which matches the 29 rows present in the dataframe. Now I just need to put them into the new column of the dataframe, but .at[] didn't work, not sure if I am using it correctly.
EDIT/UPDATE: Ok I found a sort of solution, don't know why I didn't realise this before, I just created a list beforehand and made that into the new column, ended up being much simpler.
cluster_centres_list = [cluster_centres[label] for label in cluster_assignment]
all_data_df = pd.DataFrame()
all_data_df['sentences'] = corpus
all_data_df['embeddings'] = corpus_embeddings
all_data_df['labels'] = cluster_assignment
all_data_df['cluster_centres'] = cluster_centres_list
print(all_data_df.head())

Pandas, Python, Excel, Bold a row using conditional formatting no solution

I am using python3 and pandas to create a script to:
Read unstructured xsls data of varing column lengths
Total the "this", "last" and "diff" columns
Add Total under the brands columns
Dynamically bold the entire row that contains "total"
On the last point, the challenge I have been struggling with is that the row index changes depending on the data being fed in to the script. The code provided does not have a solution to this issue. I have tried every variation I can think of using style.applymap(bold) with and without variables.
Example of input
input
Example of desired outcome
outcome
Script:
import pandas as pd
import io
import sys
import warnings
def bold(val):
return 'font-weight: bold'
excel_file = 'testfile1.xlsx'
df = pd.read_excel(excel_file)
product = (df.loc[df['Brand'] == "widgit"])
product = product.append({'Brand':'Total',
'This':product['This'].sum(),
'Last':product['Last'].sum(),
'Diff':product['Diff'].sum(),
'% Chg':product['This'].sum()/product['Last'].sum()
},
ignore_index=True)
product = product.append({'Brand':' '}, ignore_index=True)
product.fillna(' ', inplace=True)
try something like this:
def highlight_max(x):
return ['font-weight: bold' if v == x.loc[4] else ''
for v in x]
df = pd.DataFrame(np.random.randn(5, 2))
df.style.apply(highlight_max)
output:

mask function doesn't get rid of unwanted data

I'm working on a data frame taken from Adafruit IO and sadly some of my data is from a time when my project malfunctioned so some of the values are just equal NaN.
I tried to remove it by typing this code lines:
onlyValidData=temp_data.mask(temp_data['value'] =='NaN')
onlyValidData
This is data retreived from Adafruit IO Feed, getting analyzed by pandas, I tried using 'where' function too but it didn't work
my entire code is
import pandas as pd
temp_data = pd.read_json('https://io.adafruit.com/api/(...)')
light_data = pd.read_json('https://io.adafruit.com/api/(...)')
temp_data['created_at'] = pd.to_datetime(temp_data['created_at'], infer_datetime_format=True)
temp_data = temp_data.set_index('created_at')
light_data['created_at'] = pd.to_datetime(light_data['created_at'], infer_datetime_format=True)
light_data = light_data.set_index('created_at')
tempVals = pd.Series(temp_data['value'])
lightVals = pd.Series(light_data['value'])
onlyValidData=temp_data.mask(temp_data['value'] =='NaN')
onlyValidData
The output is all of my data for some reason, but it should be only the valid values.
Hey I think the issue here that you're looking for values equal to the string 'NaN', while actual NaN values aren't a string, or more specifically aren't anything.
Try using:
onlyValidData = temp_data.mask(temp_data['value'].isnull())
Edit: to remove rows rather than marking all values in that row as NaN:
onlyValidData = temp_data.dropna()

pandas How to find all zeros in a column

import copy
head6= copy.deepcopy(df)
closed_day = head6[["DATEn","COUNTn"]]\
.groupby(head6['DATEn']).sum()
print closed_day.head(10)
Output:
COUNTn
DATEn
06-29-13 11326823
06-30-13 5667746
07-01-13 8694140
07-02-13 7275701
07-03-13 9948824
07-04-13 1072542591
07-05-13 7867611
07-06-13 4733018
07-07-13 4838404
07-08-13 42962814
Now what if I want to find if COUNTn has any zeros and I want to return corresponding day? I've written something like this but I'm getting an error saying my df doesn't have any column called COUNTn
ndf = closed_day[["DATEn","COUNTn"]][closed_day.COUNTn == 0]
print ndf.head(1)
After the groupby, COUNTn is converted into a Series, which doesn't have columns (it's just a single column). If you want to keep it as a dataframe, as your code is expecting, use groupby(grouper, as_index=False).

Categories

Resources