Search for string in a dataframe first 3 word - python

In this data frame, I have the start word "PRE" in the columns containing the note, so I should update yes to the new columns, otherwise no.
For whom I got this code but it is not working.
import pandas as pd
df1 = pd.DataFrame({'NOTES': ["PREPAID_HOME_SCREEN_MAMO", "SCREEN_MAMO",
"> Unable to connect internet>4G Compatible>Set",
"No>Not Barred>Active>No>Available>Others>",
"Internet Not Working>>>Unable To Connect To"]})
df1['NOTES'].astype(str)
for i in df1['NOTES']:
if i [:3]=='PRE':
df1['new']='yes'
else:
df1['new']='No'
df1

Set df1['new'] to a list using a list comprehension and ternary operator:
df1['new'] = ['yes' if i[:3] == 'PRE' else 'no' for i in df1['NOTES']
When setting dataframe columns, you need to set them to lists, not individual elements.
For case-insensitive:
df1['new'] = ['yes' if i[:3].upper() == 'PRE' else 'no' for i in df1['NOTES']

You can use list to apppend the values and then add value to dataframe.
Code -
import pandas as pd
df1 = pd.DataFrame({'NOTES': ["PREPAID_HOME_SCREEN_MAMO", "SCREEN_MAMO",
"> Unable to connect internet>4G Compatible>Set",
"No>Not Barred>Active>No>Available>Others>",
"Internet Not Working>>>Unable To Connect To"]})
df1['NOTES'].astype(str)
data = []
for i in df1['NOTES']:
if i[:3]=='PRE':
data.append('yes')
else:
data.append('no')
df1['new'] = data

The code that you posted will update all the 'new' column values with 'yes' or 'no' based on the condition. This happens because you do not already have a column 'new'.
Try the following :
import pandas as pd
df1 = pd.DataFrame({'NOTES': ...)
df1['NOTES'].astype(str)
new=['*' for i in range(len(df1['NOTES']))]
for i in range(len(df1['NOTES'])):
if df1['NOTES'][i][0:3]=="PRE":
new[i]='Yes'
else:
new[i]='No'
df1['new']=new

Related

Filter pandas DataFrame column based on multiple conditions returns empty dataframe

I am having trouble in filtering databased on a multiple conditions.
[dataframe image][1]
[1]: https://i.stack.imgur.com/TN9Nd.png
When I filter it based on multiple condition, I am getting empty DataFrame.
user_ID_existing = input("Enter User ID:")
print("Available categories are:\n Vehicle\tGadgets")
user_Category_existing = str(input("Choose from the above category:"))
info = pd.read_excel("Test.xlsx")
data = pd.DataFrame(info)
df = data[((data.ID == user_ID_existing) & (data.Category == user_Category_existing))]
print(df)
if I replace the variables user_ID_existing and user_Category_existing with values, I am getting the rows. I even tried with numpy and only getting empty dataframe
filtered_values = np.where((data['ID'] == user_ID_existing) & (data['Category'].str.contains(user_Category_existing)))
print(filtered_values)
print(data.loc[filtered_values])
input always returs a string but since the column ID read by pandas has a number dtype, when you filter it by a string, you're then getting an empty dataframe.
You need to use int to convert the value/ID (entered by the user) to a number.
Try this :
user_ID_existing = int(input("Enter User ID:"))
print("Available categories are:\n Vehicle\tGadgets")
user_Category_existing = input("Choose from the above category:")
data = pd.read_excel("Test.xlsx")
df = data[(data["ID"].eq(user_ID_existing))
& (data["Category"].eq(user_Category_existing))].copy()
print(df)

How to delete rows in a CSV file based on blank columns

I have a csv file that is in this format, but has thousands of rows so I can summarize it like this
id,name,score1,score2,score3
1,,3.0,4.5,2.0
2,,,,
3,,4.5,3.2,4.1
I have tried to use .dropna() but that is not working.
My desired output is
id,name,score1,score2,score3
1,,3.0,4.5,2.0
3,,4.5,3.2,4.1
All I would really need is to check if score1 is empty because if score1 is empty then the rest of the scores are empty as well.
I have also tried this but it doesn't seem to do anything.
import pandas as pd
df = pd.read_csv('dataset.csv')
df.drop(df.index[(df["score1] == '')], axis=0,inplace=True)
df.to_csv('new.csv')
Can anyone help with this?
After seeing your edits, I realized that dropna doesn't work for you because you have a None value in all of your rows. To filter for nan values in a specific column, I would recommend using the apply function like in the following code. (Btw the StackOverflow.csv is just a file where I copied and pasted your data from the question)
import pandas as pd
import math
df = pd.read_csv("StackOverflow.csv", index_col="id")
#Function that takes a number and returns if its nan or not
def not_nan(number):
return not math.isnan(number)
#Filtering the dataframe with the function
df = df[df["score1"].apply(not_nan)]
What this does is iterate through the score1 row and check if a value is NaN or not. If it is, then it returns False. We then use the list of True and False values to filter out the values from the dataframe.
import pandas as pd
df = pd.DataFrame([[1,3.0,4.5,2.0],[2],[3,4.5,3.2,4.1]], columns=["id","score1","score2","score3"])
aux1 = df.dropna()
aux2 = df.dropna(axis='columns')
aux3 = df.dropna(axis='rows')
print('=== original ===')
print(df)
print()
print('=== mode 1 ===')
print(aux1)
print()
print('=== mode 2 ===')
print(aux2)
print()
print('=== mode 3 ===')
print(aux3)
print()
print('=== mode 4 ===')
print('drop original')
df.dropna(axis=1,inplace=True)
print(df)

Compare entire rows for equality if some condition is satisfied

Let's say I have the following data of a match in a CSV file:
name,match1,match2,match3
Alice,2,4,3
Bob,2,3,4
Charlie,1,0,4
I'm writing a python program. Somewhere in my program I have scores collected for a match stored in a list, say x = [1,0,4]. I have found where in the data these scores exist using pandas and I can print "found" or "not found". However I want my code to print out to which name these scores correspond to. In this case the program should output "charlie" since charlie has all these values [1,0,4]. how can I do that?
I will have a large set of data so I must be able to tell which name corresponds to the numbers I pass to the program.
Yes, here's how to compare entire rows in a dataframe:
df[(df == x).all(axis=1)].index # where x is the pd.Series we're comparing to
Also, it makes life easiest if you directly set name as the index column when you read in the CSV.
import pandas as pd
from io import StringIO
df = """\
name,match1,match2,match3
Alice,2,4,3
Bob,2,3,4
Charlie,1,0,4"""
df = pd.read_csv(StringIO(df), index_col='name')
x = pd.Series({'match1':1, 'match2':0, 'match3':4})
Now you can see that doing df == x, or equivalently df.eq(x), is not quite what you want because it does element-wise compare and returns a row of True/False. So you need to aggregate those rows with .all(axis=1) which finds rows where all comparison results were True...
df.eq(x).all(axis=1)
df[ (df == x).all(axis=1) ]
# match1 match2 match3
# name
# Charlie 1 0 4
...and then finally since you only want the name of such rows:
df[ (df == x).all(axis=1) ].index
# Index(['Charlie'], dtype='object', name='name')
df[ (df == x).all(axis=1) ].index.tolist()
# ['Charlie']
which is what you wanted. (I only added the spaces inside the expression for clarity).
You need to use DataFrame.loc which would work like this:
print(df.loc[(df.match1 == 1) & (df.match2 == 0) & (df.match3 == 4), 'name'])
Maybe try something like this:
import pandas as pd
import numpy as np
# Makes sample data
match1 = np.array([2,2,1])
match2 = np.array([4,4,0])
match3 = np.array([3,3,4])
name = np.array(['Alice','Bob','Charlie'])
df = pd.DataFrame({'name': id, 'match1': match1, 'match2':match2, 'match3' :match3})
df
# example of the list you want to get the data from
x=[1,0,4]
#x=[2,4,3]
# should return the name Charlie as well as the index (based on the values in the list x)
df['name'].loc[(df['match1'] == x[0]) & (df['match2'] == x[1]) & (df['match3'] ==x[2])]
# Makes a new dataframe out of the above
mydf = pd.DataFrame(df['name'].loc[(df['match1'] == x[0]) & (df['match2'] == x[1]) & (df['match3'] ==x[2])])
# Loop that prints out the name based on the index of mydf
# Assuming there are more than one name, it will print all. if there is only one name, it will print only that)
for i in range(0,len(mydf)):
print(mydf['name'].iloc[i])
you can use this
here data is your Data frame ,you can change accordingly your data frame name,
and
considering [1,0,4] is int type
data = data[(data['match1']== 1)&(data['match2']==0)&(data['match3']== 4 ).index
print(data[0])
if data is object type then use this
data = data[(data['match1']== "1")&(data['match2']=="0")&(data['match3']== "4" ).index
print(data[0])

Python remove everything after specific string and loop through all rows in multiple columns in a dataframe

I have a file full of URL paths like below spanning across 4 columns in a dataframe that I am trying to clean:
Path1 = ["https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID\
=0x012000EDE8B08D50FC3741A5206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D"]
I want to remove everything after a specific string which I defined it as "string1" and I would like to loop through all 4 columns in the dataframe defined as "df_MasterData":
string1 = "&FolderCTID"
import pandas as pd
df_MasterData = pd.read_excel(FN_MasterData)
cols = ['Column_A', 'Column_B', 'Column_C', 'Column_D']
for i in cols:
# Objective: Replace "&FolderCTID", delete all string after
string1 = "&FolderCTID"
# Method 1
df_MasterData[i] = df_MasterData[i].str.split(string1).str[0]
# Method 2
df_MasterData[i] = df_MasterData[i].str.split(string1).str[1].str.strip()
# Method 3
df_MasterData[i] = df_MasterData[i].str.split(string1)[:-1]
I did search and google and found similar solutions which were used but none of them work.
Can any guru shed some light on this? Any assistance is appreciated.
Added below is a few example rows in column A and B for these URLs:
Column_A = ['https://contentspace.global.xxx.com/teams/Australia/NSW/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FNSW%2FDocuments%2FIn%20Scope%2FA%20I%20TOPPER%20GROUP&FolderCTID=\
0x01200016BC4CE0C21A6645950C100F37A60ABD&View=%7B64F44840%2D04FE%2D4341%2D9FAC%2D902BB54E7F10%7D',\
'https://contentspace.global.xxx.com/teams/Australia/Victoria/Documents/Forms/AllItems.aspx?RootFolder\
=%2Fteams%2FAustralia%2FVictoria%2FDocuments%2FIn%20Scope&FolderCTID=0x0120006984C27BA03D394D9E2E95FB\
893593F9&View=%7B3276A351%2D18C1%2D4D32%2DADFF%2D54158B504FCC%7D']
Column_B = ['https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID=0x012000EDE8B08D50FC3741A5\
206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D',\
'https://contentspace.global.xxx.com/teams/Australia/QLD/Documents/Forms/AllItems.aspx?RootFolder=%\
2Fteams%2FAustralia%2FQLD%2FDocuments%2FIn%20Scope%2FAACO%20GROUP&FolderCTID=0x012000E689A6C1960E8\
648A90E6EC3BD899B1A&View=%7B6176AC45%2DC34C%2D4F7C%2D9027%2DDAEAD1391BFC%7D']
This is how i would do it,
first declare a variable with your target columns.
Then use stack() and str.split to get your target output.
finally, unstack and reapply the output to your original df.
cols_to_slice = ['ColumnA','ColumnB','ColumnC','ColumnD']
string1 = "&FolderCTID"
df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)
if you want to replace these columns in your target df then simply do -
df[cols_to_slice] = df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)
You should first get the index of string using
indexes = len(string1) + df_MasterData[i].str.find(string1)
# This selected the final location of this string
# if you don't want to add string in result just use below one
indexes = len(string1) + df_MasterData[i].str.find(string1)
Now do
df_MasterData[i] = df_MasterData[i].str[:indexes]

using pandas to find the string from a column

I am a very beginner in programming and trying to learn to code. so please bear with my bad coding. I am using pandas to find a string from a column (Combinations column in the below code ) in the data frame and print the entire row containing the string . Find the code below. Basically I need to find all the instances where the string occurs , and print the entire row .find my code below . I am not able to figure out how to find that particular instance of the column and print it .
import pandas as pd
data = pd.read_csv("signallervalues.csv",index_col=False)
data.head()
data['col1'] = data['col1'].astype(str)
data['col2'] = data['col2'].astype(str)
data['col3'] = data['col3'].astype(str)
data['col4'] = data['col4'].astype(str)
data['col5']= data['col5'].astype(str)
data.head()
combinations= data['Col1']+data['col2'] + data['col3'] + data['col4'] + data['col5']
data['combinations']= combinations
print(data.head())
list_of_combinations = data['combinations'].to_list()
print(list_of_combinations)
for i in list_of_combinations:
if data['combinations'].str.contains(i).any():
print(i+ 'data occurs in row' )
# I need to print the row containing the string here
else:
print(i +'is occuring only once')
my data frame looks like this
import pandas as pd
data=pd.DataFrame()
# recreating your data (more or less)
data['signaller']= pd.Series(['ciao', 'ciao', 'ciao'])
data['col6']= pd.Series(['-1-11-11', '11', '-1-11-11'])
list_of_combinations=['11', '-1-11-11']
data.reset_index(inplace=True)
# group by the values of column 6 and counting how many times they occur
g=data.groupby('col6')['index']
count= pd.DataFrame(g.count())
count=count.rename(columns={'index':'occurences'})
count.reset_index(inplace=True)
# create a df that keeps only the rows in the list 'list_of_combinations'
count[~count['col6'].isin(list_of_combinations)== False]
My result

Categories

Resources