compare multiple column with Numpy and panda

compare multiple column with Numpy and panda - python

Hello my goal and to find which circuit (Nom_ci)
I can't find the right path, I'm trying to find the right method,
I had done it with a set of IF ELIF ... but the times were enormous
Can you help me find the best method
thanks in advance
import pandas as pd
import numpy as np
import re
cycling = pd.DataFrame(
{
'Comp_ci': [1, 2, 3, 3, 3, 3, 3, 2, 1, 1],
'Nom_ci': ['RONCQ_A2_OPTI_SRV_S3',
'RONCQ_A3_SRV_S3, RONCQ_A2_OPTI_SRV_S3',
'RONCQ_A2_TEMP_SRV_S3, RONCQ_A3_SRV_S3, RONCQ_A2_OPTI_SRV_S3',
'RONCQ_A2_SRV_PC_S3, RONCQ_A2_TEMP_SRV_S3, RONCQ_A3_SRV_S3',
'RONCQ_A2_PC_SRV_S3, RONCQ_A2_SRV_S3, RONCQ_A2_TEMP_SRV_S3',
'RONCQ_A2_OPTI_SRV_S3, RONCQ_A2_PC_SRV_S3, RONCQ_A2_SRV_S3',
'RONCQ_A3_SRV_S3, RONCQ_A2_OPTI_SRV_S3, RONCQ_A2_PC_SRV_S3',
'RONCQ_A2_TEMP_SRV_S3, RONCQ_A3_SRV_S3',
'RONCQ_A2_SRV_S3',
'RONCQ_A2_PC_SRV_S3'],
'result hope':['autre','RONCQ_A3_VSR_S3','RONCQ_A3_VSR_S3','RONCQ_A3_VSR_S3','RONCQ_A2_VSR_S3','RONCQ_A2_VSR_S3','RONCQ_A3_VSR_S3','RONCQ_A3_VSR_S3','RONCQ_A2_VSR_S3','autre']
}
)
print(cycling)
condition=((cycling['Count RSF Circuit']==1) &
(cycling['Nom ConcatSet'][0].str.contains("_OPTI").eq(False)) &
(cycling['Nom ConcatSet'][0].str.contains("_TEMP").eq(False))&
(cycling['Nom ConcatSet'][0].str.contains("_PC").eq(False)))
cycling['col3'] = np.where(condition, cycling['Nom ConcatSet'], 'autre')
print(cycling)

EDIT :
Ok, I think I have understood what you tried to achieve : is that it ?
temp = cycling.Nom_ci.str.split(', +') # will split on ',' or ' ' (using regex)
print(temp)
print('-'*50)
temp = temp.explode() #will explode the lists to one serie (do note that the indexes are kept untouched)
print(temp)
print('-'*50)
temp = temp.to_frame() #will convert your serie to a dataframe
print(type(temp))
print('-'*50)
temp['match'] = temp['Nom_ci'].str.contains('(_TEMP)|(_PC)|(_OPTI)')==False #will get you a boolean serie (using regex) from your patterns, which will allow you to select the desired strings
print(temp)
print('-'*50)
temp = temp[temp.match==True] #do select the rows corresponding to your criteria (note that the indexes are still untouched)
print(temp)
print('-'*50)
temp.rename({'Nom_ci':'col3'}, axis=1, inplace=True) #rename your column to whatever you want
print(temp)
print('-'*50)
temp.drop('match', inplace=True, axis=1) #drop the "match" column which is now useless
print(temp)
print('-'*50)
cycling = cycling.join(temp) #join the dataframes based on indexes
print(temp)
print('-'*50)
cycling['col3'].fillna('autre', inplace=True) #fill the "nan" values with "autres"
print(cycling)

Related

python define values for each data-frame if they meet a condition

i have 5 different data frames that are output of different conditions or tables.
i want to have an output if these data-frames are empty or not. basically i will define with len(df) each data frame and will pass a string if they have anything in them.
def(df1,df2,df3,df4,df5)
if len(df1) > 0:
"df1 not empty"
else: ""
if len(df2) > 0:
"df2 not empty"
else: ""
then i want to append these string to each other and will have a string like
**df1 not empty, df3 not empty**

try this :
import pandas as pd
dfs = {'milk': pd.DataFrame(['a']), 'bread': pd.DataFrame(['b']), 'potato': pd.DataFrame()}
print(''.join(
[f'{name} not empty. ' for name, df in dfs.items() if (not df.empty)])
)
output:
milk not empty. bread not empty.

data = [1,2,3]
df = pd.DataFrame(data, columns=['col1']) #create not empty df
data1 = []
df1 = pd.DataFrame(data) #create empty df
dfs = [df, df1] #list them
#the "for loop" is replaced here by a list comprehension
#I used enumerate() to attribute an index to each df in the list of dfs, because otherwise in the print output if you call directly df0 or df1 it will print th entire dataframe, not only his name
print(' '.join([f'df{i} is not empty.' for i,df in enumerate(dfs) if not df.empty]))
Result:
df0 is not empty. df1 is not empty.

With a one-liner:
dfs = [df1,df2,df3,df4,df5]
output = ["your string here" for df in dfs if not df.empty]
You can then concatenate strings together, if you want:
final_string = "; ".join(output)

Compare entire rows for equality if some condition is satisfied

Let's say I have the following data of a match in a CSV file:
name,match1,match2,match3
Alice,2,4,3
Bob,2,3,4
Charlie,1,0,4
I'm writing a python program. Somewhere in my program I have scores collected for a match stored in a list, say x = [1,0,4]. I have found where in the data these scores exist using pandas and I can print "found" or "not found". However I want my code to print out to which name these scores correspond to. In this case the program should output "charlie" since charlie has all these values [1,0,4]. how can I do that?
I will have a large set of data so I must be able to tell which name corresponds to the numbers I pass to the program.

Yes, here's how to compare entire rows in a dataframe:
df[(df == x).all(axis=1)].index # where x is the pd.Series we're comparing to
Also, it makes life easiest if you directly set name as the index column when you read in the CSV.
import pandas as pd
from io import StringIO
df = """\
name,match1,match2,match3
Alice,2,4,3
Bob,2,3,4
Charlie,1,0,4"""
df = pd.read_csv(StringIO(df), index_col='name')
x = pd.Series({'match1':1, 'match2':0, 'match3':4})
Now you can see that doing df == x, or equivalently df.eq(x), is not quite what you want because it does element-wise compare and returns a row of True/False. So you need to aggregate those rows with .all(axis=1) which finds rows where all comparison results were True...
df.eq(x).all(axis=1)
df[ (df == x).all(axis=1) ]
# match1 match2 match3
# name
# Charlie 1 0 4
...and then finally since you only want the name of such rows:
df[ (df == x).all(axis=1) ].index
# Index(['Charlie'], dtype='object', name='name')
df[ (df == x).all(axis=1) ].index.tolist()
# ['Charlie']
which is what you wanted. (I only added the spaces inside the expression for clarity).

You need to use DataFrame.loc which would work like this:
print(df.loc[(df.match1 == 1) & (df.match2 == 0) & (df.match3 == 4), 'name'])

Maybe try something like this:
import pandas as pd
import numpy as np
# Makes sample data
match1 = np.array([2,2,1])
match2 = np.array([4,4,0])
match3 = np.array([3,3,4])
name = np.array(['Alice','Bob','Charlie'])
df = pd.DataFrame({'name': id, 'match1': match1, 'match2':match2, 'match3' :match3})
df
# example of the list you want to get the data from
x=[1,0,4]
#x=[2,4,3]
# should return the name Charlie as well as the index (based on the values in the list x)
df['name'].loc[(df['match1'] == x[0]) & (df['match2'] == x[1]) & (df['match3'] ==x[2])]
# Makes a new dataframe out of the above
mydf = pd.DataFrame(df['name'].loc[(df['match1'] == x[0]) & (df['match2'] == x[1]) & (df['match3'] ==x[2])])
# Loop that prints out the name based on the index of mydf
# Assuming there are more than one name, it will print all. if there is only one name, it will print only that)
for i in range(0,len(mydf)):
print(mydf['name'].iloc[i])

you can use this
here data is your Data frame ,you can change accordingly your data frame name,
and
considering [1,0,4] is int type
data = data[(data['match1']== 1)&(data['match2']==0)&(data['match3']== 4 ).index
print(data[0])
if data is object type then use this
data = data[(data['match1']== "1")&(data['match2']=="0")&(data['match3']== "4" ).index
print(data[0])

Delete rows in pandas frame that have other shape

I am trying to delete rows in a pandas data frame that have another shape than (99, 13) in the 'MEL' column.
path MEL word
0 8d37d10e7f97ddea2eca9d39a4cf821b4457b041.wav [[-10.160675, -13.804866, 0.9188097, 4.415375,... one
1 9a8f761be3fa0d0a963f5612ba73e68cc0ad11ba.wav [[-10.482644, -13.339122, -3.4994812, -5.29343... one
2 314cdc39f628bc68d216498b2080bcc7a549a45f.wav [[-11.076196, -13.980294, -17.289637, -41.0668... one
3 cc499e63eee4a3bcca48b5b452df04990df83570.wav [[-13.830213, -12.64104, -3.7780707, -10.76490... one
4 38cdcc4d9432ce4a2fe63e0998dbca91e64b954a.wav [[-11.967776, -23.27864, -10.3656, -8.786977, ... one
I have tried the folowing:
indexNames = merged[ merged['MEL'].shape != (99,13) ].index
merged.drop(indexNames , inplace=True)
The first line of code however gives me key error: True. Anyone an idea on how to make this happen?

The condition
merged['MEL'].shape != (99,13)
evaluates to either True or False.
Please note that you may select rows from a DataFrame using a boolean vector the same length as the DataFrame’s index (for example, something derived from one of the columns of the DataFrame). More here: https://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing
EDIT: This code might help
# generate sample dataset
df = pd.DataFrame(data = {'col1': [np.random.rand(3,2),np.random.rand(5,2),np.random.rand(7,8),np.random.rand(5,2)],
'col2': ['b','a','b','q'],
'col3': ['c','c','c','q'],
'col4': ['d','d','d','q'],
'col5': ['e','e','a','q'] })
for index in df.index:
if df.loc[index]['col1'].shape !=(5,2):
df.drop(index , inplace=True)
EDIT2: Without a loop:
df = pd.DataFrame(data = {'col1': [np.random.rand(3,2),np.random.rand(5,2),np.random.rand(7,8),np.random.rand(5,2)],
'col2': ['b','a','b','q'],
'col3': ['c','c','c','q'],
'col4': ['d','d','d','q'],
'col5': ['e','e','a','q'] })
df['shapes'] = [x.shape for x in df.col1.values]
df = df[df['shapes']!=(5,2)].drop('shapes', axis = 1)

... In other words, you want all the rows where the column 'MEL' has the shape (99, 13). I would do
my_desired_df = merged[merged['MEL'].shape == (99,13)]

You need to get a series of the shapes
df['MEL'].apply(lambda x: x.shape)
Then you can test this to get a boolean series
df['MEL'].apply(lambda x: x.shape) == (93,3)
And then index with the boolean series
new_df = df.loc[df['MEL'].apply(lambda x: x.shape) == (93,3), :]
This will give you everything that matches your shape. It's probably easier to do it this way then to play with df.drop(), but you could do that do with
correct = df['MEL'].apply(lambda x: x.shape) == (93,3)
new_df = df.drop(correct[~correct].index)

Check for each row in several columns and append for each row if the requirement is met or not. python

I have the following example of my dataframe:
df = pd.DataFrame({'first_date': ['01-07-2017', '01-07-2017', '01-08-2017'],
'end_date': ['01-08-2017', '01-08-2017', '15-08-2017'],
'second_date': ['01-09-2017', '01-08-2017', '15-07-2017'],
'cust_num': [1, 2, 1],
'Title': ['philips', 'samsung', 'philips']})
If the cus_num is equal in the column
The Title is equal for both rows in the dataframe
The second_date in a row <= end_date in an other row
If all these requirements are met the value True should be appended to a new column in the original row.
Because I'm working with a big dataset I'm looking for an efficient way to do this.
In this case only the first record should get a true value.
I have checked for the apply with lambda and groupby function in python but couldnt find a way to make these work.

Try this (spontaneously I cannot come up with a faster method):
import pandas as pd
import numpy as np
df["second_date"]=pd.to_datetime(df["second_date"], format='%d-%m-%Y')
df["end_date"]=pd.to_datetime(df["end_date"], format='%d-%m-%Y')
df["new col"] = False
for cust in set(df["cust_num"]):
indices = df.index[df["cust_num"] == cust].tolist()
if len(indices) > 1:
sub_df = df.loc[indices]
for title in set(df.loc[indices]["Title"]):
indices_title = sub_df.index[sub_df["Title"] == title]
if len(indices_title) > 1:
for i in indices_title:
if sub_df.loc[indices_title]["second_date"][i] <= sub_df.loc[indices_title]["end_date"][i]:
df["new col"] = True
break
df["new_col"] = new_col
First you need to make all date columns comparable with eachother by casting them into datetime. Then create the additional column you want.
Now create a set of all unique customer numbers and iterate through them. For each customer number get a list of all row indices with this customer number. If this list is longer than 1, then you have several same customer numbers. Then you create a sub df of your dataframe with all rows with the same customer number. Then iterate through the set of all titles. For each title check if there is the same title somewhere else in the sub df (len > 1). If this is the case, then iterate through all rows and write True in your additional column in the same row where the date condition is met for the first time.

This should work. Also while reading comments, I am assuming that all cust_num is unique.
import pandas as pd
df = pd.DataFrame({'first_date': ['01-07-2017', '01-07-2017', '01-08-2017'],
'end_date': ['01-08-2017', '01-08-2017', '15-08-2017'],
'second_date': ['01-09-2017', '01-08-2017', '15-07-2017'],
'cust_num': [1, 2, 1],
'Title': ['philips', 'samsung', 'philips']})
df["second_date"]=pd.to_datetime(df["second_date"])
df["end_date"]=pd.to_datetime(df["end_date"])
df['Value'] = False
for i in range(len(df)):
for j in range(len(df)):
if (i != j):
if (df.loc[j,'end_date'] >= df.loc[i,'second_date']) == True:
if (df.loc[i,'cust_num'] == df.loc[j,'cust_num']) == True:
if (df.loc[i,'Title'] == df.loc[j,'Title']) == True:
df.loc[i,'Value'] = True
Tell me if this code works! and any errors.

How to subset a pandas dataframe based on a condtion of a categorical variable

My goal
I'm struggling with creating a subset of a dataframe based on the content of the categorical variable S11AQ1A20. In all the howtos that I came across the categorical variable contained string data but in my case it's integer values that have a specific meaning (YES = 1, NO = 0, 9 = Unknown). Therefore, I added categories to let pandas label the values properly.
Ideally, Case A and B in the sample code below would contain 5 rows after the subsetting is done. But currently, it only works if i don't label the integer values.
What I have figured out so far
Case B shows that the subsetting ins't performed as expcted as soon as categories are added with the following line:
df.S11AQ1A20 = df.S11AQ1A20.cat.rename_categories(['Yes', 'No', 'Unknown'])
Sample Dataset
The sample dataset (nesarc_short.csv) used for testing can be found here: https://pastebin.com/NkTeBsDR
Example Code:
dataset_path = 'nesarc_short.csv'
df = pd.read_csv(dataset_path, low_memory=False, na_values=' ')
print('CASE A: NUMERICAL -> working\n')
df = pd.read_csv(dataset_path, low_memory=False, na_values=' ')
print("A: Rows before: " + str(len(df.S11AQ1A20))) # Outputs: 100
df = df[(df.S11AQ1A20 == 1)]
print("A: Rows after: " + str(len(df.S11AQ1A20))) # Outputs: 5
###############################################################
print('\nCASE B: CATEGORICAL -> Not working\n')
df = pd.read_csv(dataset_path, low_memory=False, dtype={ 'S11AQ1A20' : 'category' }, na_values=' ')
# If this is commented out, the subsetting works but no labels will be available
df.S11AQ1A20 = df.S11AQ1A20.cat.rename_categories(['Yes', 'No', 'Unknown'])
print("B: Rows before: " + str(len(df.S11AQ1A20))) # Outputs: 100
df = df[(df.S11AQ1A20 == 'YES') | (df.S11AQ1A20 == '1') | (df.S11AQ1A20 == 1)]
print("B: Rows after: " + str(len(df.S11AQ1A20))) # Outputs: 0
Console output
CASE A: NUMERICAL-> working
A: Rows before: 100
A: Rows after: 5
CASE B: CATEGORICAL -> Not working
B: Rows before: 100
B: Rows after: 0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

compare multiple column with Numpy and panda - python

Related

python define values for each data-frame if they meet a condition

Compare entire rows for equality if some condition is satisfied

Delete rows in pandas frame that have other shape

Check for each row in several columns and append for each row if the requirement is met or not. python

How to subset a pandas dataframe based on a condtion of a categorical variable

Categories

Resources