I have a NUM column, I try to filter rows where column NUM is valid (true) and:
Update current dataframe
Insert count of wrong rows into dict report
I try this:
report["NUM"] = dataset['NUM'].apply(~isValid).count()
So, it does not work for me.
Dataframe is:
NUM AGE COUNTRY
1 18 USA
2 19 USA
3 30 AU
The isValid it is a function
def isValid(value):
return True
Remark:
I use this rule:
report["NUM"] = (~dataset['NUM'].apply(checkNumber)).sum()
I get this error:
report["NUM"] = (~dataset['NUM'].apply(luhn)).sum()
C:\Users\Oleh\AppData\Local\Temp\ipykernel_17284\2678582562.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
If you want to count the rows where isValid outputs False:
(~dataset['NUM'].apply(isValid)).sum()
output: 0
edit
m = dataset['NUM'].apply(isValid)
report["NUM"] = (~m).sum()
dataset2 = dataset[m]
You need
dataset[dataset['NUM'].map(isValid) == False].count()
Because
dataset['NUM'].apply(~isValid)
is just wrong.
isValid is a function, ~isValid is like not isValid, which I'm guessing evaluates to False? I'm not sure.
Also
dataset[col].apply(func)
will return the whole dataset with the values returned by the function for each row. If you want to filter out the False ones you need the
df[df[col]==True]
syntax. If you had a new column say
df["valid"] = dataset[col].map(func)
You could then do
df.query("valid is False")
Or something of the sort
def isValid(value):
return True
my_df = pd.DataFrame({'NUM':[1,2,3], 'AGE':[18,19,20], 'COUNTRY':['USA','USA','AU']})
report = {'wrong_rows':(~my_df.NUM.apply(isValid)).sum()}
Related
I'm really amateur-level with both python and pandas, but I'm trying to solve an issue for work that's stumping me.
I have two dataframes, let's call them dfA and dfB:
dfA:
project_id Category Initiative
10
20
30
40
dfB:
project_id field_id value
10 100 lorem
10 200 lorem1
10 300 lorem2
20 200 ipsum
20 300 ipsum1
20 500 ipsum2
Let's say I know "Category" from dfA correlates to field_id "100" from dfB, and "Initiative" correlates to field_id "200".
I need to look through dfB and for a given project_id/field_id combination, take the corresponding value in the "value" column and place it in the correct cell in dfA.
The result would look like this:
dfA:
project_id Category Initiative
10 lorem lorem1
20 ipsum
30
40
Bonus difficulty: not every project in dfA exists in dfB, and not every field_id is used in every project_id.
I hope I've explained this well enough; I feel like there must be a relatively simple way to handle this that I'm missing.
You could do something like this although it's not very elegant, there must be a better way. I had to use try/except because of the cases where the project Id is not available in the dfB. I put NaN values for the missing ones but you can easily put empty strings.
def get_value(row):
try:
res = dfB[(dfB['field_id'] == 100) & (dfB['project_id'] == row['project_id'])]['value'].iloc[0]
except:
res = np.nan
row['Categorie'] = res
try:
res = dfB[(dfB['field_id'] == 200) & (dfB['project_id'] == row['project_id'])]['value'].iloc[0]
except:
res = np.nan
row['Initiative'] = res
return row
dfA = dfA.apply(get_value, axis=1)
EDIT: as mentioned in comment, this is not very flexible as some values are hardcoded but you can easily change that with something like the below. This way, if the field_id change or you need to add/remove a column, just update the dictionary.
columns_field = {"Category": 100, "Initiative": 200}
def get_value(row):
for key, value in columns_fields.items():
try:
res = dfB[(dfB['field_id'] == value) & (dfB['project_id'] == row['project_id'])]['value'].iloc[0]
except:
res = np.nan
row[key] = res
return row
dfA = dfA.apply(get_value, axis=1)
How do you search if a value exist in a specific row?
Example I have this file which contains the following:
ID Name
1 Mark
2 John
3 Mary
The user will input 1 and it will
print("the value already exist.")
But if the user input 4 it will add a new row containing 4 and
name = input('Name')
and update the file like this
ID Name
1 Mark
2 John
3 Mary
4 (userinput)
An easy approach will be:
import pandas as pd
bool_val = False
for i in range(0, df.shape[0]):
if str(df.iloc[i]['ID']) == str(input_str):
bool_val = False
break
else:
print("there")
bool_val = True
if bool_val == True:
df = df.append(pd.Series([input_str, name], index = ['ID', 'Name']), ignore_index=True)
Remember to add the parameter ignore_index to avoid TypeError. I added a bool value to avoid appending a row multiple times.
searchid=20 #use sys.argv[1] if needed to be passed as argument to the program. Or read it as raw_input
if str(searchid) in df.index.astype(str):
print("ID found")
else:
name=raw_input("ID not found. Specify the name for this ID to update the data:") #use input() if python version >= 3
df.loc[searchid]=[str(name)]
If ID is not index:
if str(searchid) in df.ID.values.astype(str):
print("ID found")
else:
name=raw_input("ID not found. Specify the name for this ID to update the data:") #use input() if python version >= 3
df.loc[searchid]=[str(searchid),str(name)]
specifying column headers to update during df update might avoid errors of mismatch:
df.loc[searchid]={'ID': str(searchid), 'Name': str(name)}
This should help
Also read at https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html, that mentions the inherent nature of append and concat to copy the full dataframe.
df.loc['ID'] will return the row containing the ID in the index of the dataframe. Assuming IDs are the index values of the df you are referring to.
If you have a list of IDs and wish to search for them all together then:
assuming:
listofids=['ID1','ID2','ID3']
df.loc[listofids]
will yield the rows containing the above IDs
If IDs are not in index then:
Assuming df['ids'] contain the given ID list:
'searchedID' in df.ids.values
will return True or False based on presence or absence
This is my first time posting a question, so take it easy on me if I don't know stack overflow norm of asking questions.
Attached is a snippet of what I am trying to accomplish on my side-project. I want to be able to compare a user input with a database .xlsx file that was imported by pandas.
I want to compare the user input with the database column ['Component'], if that component is there, it will grab its properties associated with that component.
comp_loc = r'C:\Users\ayubi\Documents\Python Files\Chemical_Database.xlsx'
data = pd.read_excel(comp_loc)
print(data)
LK = input('What is the Light Key?: ') #Answer should be Benzene in this case
if LK == data['Component'].any():
Tcrit = data['TC, (K)']
Pcrit = data['PC, (bar)']
A = data['A']
B = data['B']
C = data['C']
D = data['D']
else:
print('False')
Results
Component TC, (K) PC, (bar) A B C D
0 Benzene 562.2 48.9 -6.983 1.332 -2.629 -3.333
1 Toluene 591.8 41.0 -7.286 1.381 -2.834 -2.792
What is the Light Key?: Benzene
False
Please let me know if you have any questions.
I do appreciate your help!
You can do this by taking advantage of indices and using the df.loc accessor in pandas:
# set index to Component column for convenience
data = data.set_index('Component')
LK = input('What is the Light Key?: ') #Answer should be Benzene in this case
# check if your lookup is in the index
if LK in data.index:
# grab the row by the index using .loc
row = data.loc[LK]
# if the column name has spaces, you need to access as key
Tcrit = row['TC, (K)']
Pcrit = row['PC, (bar)']
# if the column name doesn't have a space, you can access as attribute
A = row.A
B = row.B
C = row.C
D = row.D
else:
print('False')
This is a great case for an Index. Set 'Component' to the Index, and then you can use a very fast loc call to look up the data. Instead of the if-else use a try-except as a KeyError is going to tell you that the LK doesn't exist, without requiring the slower check of first checking whether it's in the index.
I also highly suggest you keep the values as a single Series, instead of floating around as 6 different varibales. It's simple to access each item by the Series index, i.e. Series['A'].
df = df.set_index('Component')
def grab_data(df, LK):
try:
return df.loc[LK]
except KeyError:
return False
grab_data(df, 'Benzene')
#TC, (K) 562.200
#PC, (bar) 48.900
#A -6.983
#B 1.332
#C -2.629
#D -3.333
#Name: Benzene, dtype: float64
grab_data(df, 'foo')
#False
I use python and I have data of 35 000 rows I need to change values by loop but it takes too much time
ps: I have columns named by succes_1, succes_2, succes_5, succes_7....suces_120 so I get the name of the column by the other loop the values depend on the other column
exemple:
SK_1 Sk_2 Sk_5 .... SK_120 Succes_1 Succes_2 ... Succes_120
1 0 1 0 1 0 0
1 1 0 1 2 1 1
for i in range(len(data_jeux)):
for d in range (len(succ_len)):
ids = succ_len[d]
if data_jeux['SK_%s' % ids][i] == 1:
data_jeux.iloc[i]['Succes_%s' % ids]= 1+i
I ask if there is a way for executing this problem with the faster way I try :
data_jeux.values[i, ('Succes_%s' % ids)] = 1+i
but it returns me the following error maybe it doesn't accept string index
You can define columns and then use loc to increment. It's not clear whether your columns are naturally ordered; if they aren't you can use sorted with a custom function. String-based sorting will cause '20' to come before '100'.
def splitter(x):
return int(x.rsplit('_', maxsplit=1)[-1])
cols = df.columns
sk_cols = sorted(cols[cols.str.startswith('SK')], key=splitter)
succ_cols = sorted(cols[cols.str.startswith('Succes')], key=splitter)
df.loc[df[sk_cols] == 1, succ_cols] += 1
First post here, I am trying to find out total count of values in an excel file. So after importing the file, I need to run a condition which is count all the values except 0 also where it finds 0 make that blank.
> df6 = df5.append(df5.ne(0).sum().rename('Final Value'))
I tried the above one but not working properly, It is counting the column name as well, I only need to count the float values.
Demo DataFrame:
0 1 2 3
ID_REF 1007_s 1053_a 117_at 121_at
GSM95473 0.08277 0.00874 0.00363 0.01877
GSM95474 0.09503 0.00592 0.00352 0
GSM95475 0.08486 0.00678 0.00386 0.01973
GSM95476 0.08105 0.00913 0.00306 0.01801
GSM95477 0.00000 0.00812 0.00428 0
GSM95478 0.07615 0.00777 0.00438 0.01799
GSM95479 0 0.00508 1 0
GSM95480 0.08499 0.00442 0.00298 0.01897
GSM95481 0.08893 0.00734 0.00204 0
0 1 2 3
ID_REF 1007_s 1053_a 117_at 121_at
These are column name and index value which needs to be ignored when counting.
The output Should be like this after counting:
Final 8 9 9 5
If you just nee the count, but change the values in your dataframe, you could apply a function to each cell in your DataFrame with the applymap method. First create a function to check for a float:
def floatcheck(value):
if isinstance(value, float):
return 1
else:
return 0
Then apply it to your dataframe:
df6 = df5.applymap(floatcheck)
This will create a dataframe with a 1 if the value is a float and a 0 if not. Then you can apply your sum method:
df7 = df6.append(df6.sum().rename("Final Value"))
I was able to solve the issue, So here it is:
df5 = df4.append(pd.DataFrame(dict(((df4[1:] != 1) & (df4[1:] != 0)).sum()), index=['Final']))
df5.columns = df4.columns
went = df5.to_csv("output3.csv")
What i did was i changed the starting index so i didn't count the first row which was alphanumeric and then i just compared it.
Thanks for your response.