How to append dataframe column name in the list? - python

I am new to pandas.
So I am try to append column names in a list whose correlations is greater then zero.
here is my code
corr_matrix = df_train.corr()
corr_matrix["failure"].sort_values(ascending=False)
useful_features = []
for f in corr_matrix["failure"]:
if f > 0:
useful_features.append(df_train.columns)
print(useful_features)
But this is appending all column names to the list
[Index(['id', 'product_code', 'loading', 'attribute_0', 'attribute_1',
'attribute_2', 'attribute_3', 'measurement_0', 'measurement_1',
'measurement_2', 'measurement_3', 'measurement_4', 'measurement_5',
'measurement_6', 'measurement_7', 'measurement_8', 'measurement_9',
'measurement_10', 'measurement_11', 'measurement_12', 'measurement_13',
'measurement_14', 'measurement_15', 'measurement_16', 'measurement_17',
'failure', 'kfold'],
.
.
.
I am not pasting complete output
What I want is
useful_features = ['failure','loading',...,'kfold']
Output of
corr_matrix["failure"].sort_values(ascending=False)
failure 1.000000
loading 0.129089
measurement_17 0.033905
measurement_5 0.018079
measurement_8 0.017119
measurement_7 0.016787
measurement_2 0.015808
measurement_6 0.014791
measurement_0 0.009646
attribute_2 0.006337
measurement_14 0.006211
measurement_12 0.004398
measurement_3 0.003577
measurement_16 0.002237
kfold 0.000130
measurement_10 -0.001515
measurement_13 -0.001831
measurement_15 -0.003544
measurement_9 -0.003587
measurement_11 -0.004801
id -0.007545
measurement_4 -0.010488
measurement_1 -0.010810
attribute_3 -0.019222
Name: failure, dtype: float64
Is there any way to append the column names?
df_train.columns.values also appends all names in the list

You can use indexing to do this:
print(
corr_matrix.index[corr_matrix["failure"] > 0]
)
This translates to
Get the index from corr matrix
Evaluate when "failure" column is > 0
Use the above evaluation to filter the index

Related

Use a value from one dataframe to lookup the value in another and return an adjacent cell value and update the first dataframe value

I have a 2 datasets (dataframes), one called source and the other crossmap. I am trying to find rows with a specific column value starting with "999", if one is found I need to look up the complete value of that column (e.x. "99912345") on the crossmap dataset (dataframe) and return the value from a column on that row in the cross-map.
# Source Dataframe
0 1 2 3 4
------ -------- -- --------- -----
0 303290 544981 2 408300622 85882
1 321833 99910722 1 408300902 85897
2 323241 99902978 3 408056001 95564
# Cross Map Dataframe
ID NDC ID DIN(NDC) GTIN NAME PRDID
------- ------ -------- -------------- ---------------------- -----
44563 321833 99910722 99910722000000 SALBUTAMOL SULFATE (A) 90367
69281 321833 99910722 99910722000000 SALBUTAMOL SULFATE (A) 90367
6002800 323241 99902978 75402850039706 EPINEPHRINE (A) 95564
8001116 323241 99902978 99902978000000 EPINEPHRINE (A) 95564
The 'straw dog' logic I am working with is this:
search source file and find '999' entries in column 1
df_source[df_source['Column1'].str.contains('999')]
interate through the rows returned and search for the value in column 1 in the crossmap dataframe column (DIN(NDC)) and return the corresponding PRDID
update the source dataframe with the PRDID, and write the updated file
It is these last two logic pieces where I am struggling with how to do this. Appreciate any direction/guidance anyone can provide.
Is there maybe a better/easier means of doing this using python but not pandas/dataframes?
So, as far as I understood you correctly: we are looking for the first digits of 999 in the 'Source Dataframe' in the first column of the value. Next, we find these values in the 'Cross Map' column 'DIN(NDC)' and we get the values of the column 'PRDID' on these lines.
If everything is correct, then I can't understand your further actions?
import pandas as pd
import more_itertools as mit
Cross_Map = pd.DataFrame({'DIN(NDC)': [99910722, 99910722, 99902978, 99902978],
'PRDID': [90367, 90367, 95564, 95564]})
df = pd.DataFrame({0: [303290, 321833, 323241], 1: [544981, 99910722, 99902978], 2: [2, 1, 3],
3: [408300622, 408300902, 408056001], 4: [85882, 85897, 95564]})
m = [i for i in df[1] if str(i)[:3] == '999'] #find the values in column 1
index = list(mit.locate(list(Cross_Map['DIN(NDC)']), lambda x: x in m)) #get the indexes of the matched column values DIN(NDC)
print(Cross_Map['PRDID'][index])

pandas get the min/max value of a row in a dataframe of only those rows that contain a certain string in another column

I feel really stupid now, this should be easy.
I got good help here how-to-keep-the-index-of-my-pandas-dataframe-after-normalazation-json
I need to get the min/max value in the column 'price' only where the value in the column 'type' is buy/sell. Ultimately I want to get back the 'id' also for that specific order.
So first of I need the price value and second I need to get back the value of 'id' corresponding.
You can find the dataframe that I'm working with in the link.
What I can do is find the min/max value of the whole column 'price' like so :
x = df['price'].max() # = max price
and I can sort out all the "buy" type like so:
d = df[['type', 'price']].value_counts(ascending=True).loc['buy']
but I still can't do both at the same time.
you have to use the .loc method in the dataframe in order to filter the type.
import pandas as pd
data = {"type":["buy","other","sell","buy"], "price":[15,222,11,25]}
df = pd.DataFrame(data)
buy_and_sell = df.loc[df['type'].isin(["sell","buy"])]
min_value = buy_and_sell['price'].min()
max_value = buy_and_sell['price'].max()
min_rows = buy_and_sell.loc[buy_and_sell['price']==min_value]
max_rows = buy_and_sell.loc[buy_and_sell['price']==max_value]
min_rows and max_rows can contain multiple rows because is posible that the same min price is repeated.
To extract the index just use .index.
hbid = df.loc[df.type == 'buy'].min()[['price', 'txid']]
gives me the lowest value of price and the lowest value of txid and not the id that belongs to the order with lowest price . . any help or tips would be greatly appreciated !
0 OMG4EA-Z2WUP-AQJ2XU None ... buy 0.00200000 XBTEUR # limit 14600.0
1 OBTJMX-WTQSU-DNEOES None ... buy 0.00100000 XBTEUR # limit 14700.0
2 OAULXQ-3B5WJ-LMLSUC None ... buy 0.00100000 XBTEUR # limit 14800.0
[3 rows x 23 columns]
highest buy order =
14800.0
here the id and price . . txid =
price 14600.0
txid OAULXQ-3B5WJ-LMLSUC
I' m still not sure how your line isin works. buy_and_sell not specified ;)
How I did it -->
I now first found the highest buy, then found the 'txid' for that price, then I had to remove the index from the returned series. And finally I had to remove a whitespace before my string. no idea how it came there
def get_highest_sell_txid():
hs = df.loc[df.type == 'sell', :].max()['price']
hsid = df.loc[df.price == hs, :]
xd = hsid['txid']
return xd.to_string(index=False)
xd = get_highest_sell_txid()
sd = xd.strip()
cancel_order = 'python -m krakenapi CancelOrder txid=' + sd #
subprocess.run(cancel_order)

Search for column in pandas

How do you search if a value exist in a specific row?
Example I have this file which contains the following:
ID Name
1 Mark
2 John
3 Mary
The user will input 1 and it will
print("the value already exist.")
But if the user input 4 it will add a new row containing 4 and
name = input('Name')
and update the file like this
ID Name
1 Mark
2 John
3 Mary
4 (userinput)
An easy approach will be:
import pandas as pd
bool_val = False
for i in range(0, df.shape[0]):
if str(df.iloc[i]['ID']) == str(input_str):
bool_val = False
break
else:
print("there")
bool_val = True
if bool_val == True:
df = df.append(pd.Series([input_str, name], index = ['ID', 'Name']), ignore_index=True)
Remember to add the parameter ignore_index to avoid TypeError. I added a bool value to avoid appending a row multiple times.
searchid=20 #use sys.argv[1] if needed to be passed as argument to the program. Or read it as raw_input
if str(searchid) in df.index.astype(str):
print("ID found")
else:
name=raw_input("ID not found. Specify the name for this ID to update the data:") #use input() if python version >= 3
df.loc[searchid]=[str(name)]
If ID is not index:
if str(searchid) in df.ID.values.astype(str):
print("ID found")
else:
name=raw_input("ID not found. Specify the name for this ID to update the data:") #use input() if python version >= 3
df.loc[searchid]=[str(searchid),str(name)]
specifying column headers to update during df update might avoid errors of mismatch:
df.loc[searchid]={'ID': str(searchid), 'Name': str(name)}
This should help
Also read at https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html, that mentions the inherent nature of append and concat to copy the full dataframe.
df.loc['ID'] will return the row containing the ID in the index of the dataframe. Assuming IDs are the index values of the df you are referring to.
If you have a list of IDs and wish to search for them all together then:
assuming:
listofids=['ID1','ID2','ID3']
df.loc[listofids]
will yield the rows containing the above IDs
If IDs are not in index then:
Assuming df['ids'] contain the given ID list:
'searchedID' in df.ids.values
will return True or False based on presence or absence

Create a dataframe from one dictionary and remove a specific character

I would like to know if it is possible to create a dataframe from two dictionaries.
I get two dictionaries like this:
dict= {'MO': ['N-2', 'N-8', 'N-7', 'N-6', 'N-9'], 'MO2': ['N0-6'], 'MO3': ['N-2']}
My result will be like this :
ID NUM
0 MO 'N-2', 'N-8', 'N-7', 'N-6', 'N-9'
1 MO2 'N0-6'
2 MO3 'N-2'
I try to obtain this result but in the column with the value I get [] and I can't remove it
liste_id=list(dict.keys())
liste_num=list(dict.values())
df = pandas.DataFrame({'ID':liste_id,'NUM':liste_num})
Merge the values in the dictionary into a string, before creating the dataframe; this ensures the arrays are of the same length
pd.DataFrame([(key, ", ".join(value))
for key, value in dicts.items()],
columns = ['ID', 'NUM'])
ID NUM
0 MO N-2, N-8, N-7, N-6, N-9
1 MO2 N0-6
2 MO3 N-2

Reorder row values csv pandas

I have a csv file
1 , name , 1012B-Amazon , 2044C-Flipcart , Bosh27-Walmart
2 , name , Kelvi20-Flipcart, LG-Walmart
3, name , Kenstar-Walmart, Sony-Amazon , Kenstar-Flipcart
4, name , LG18-Walmart, Bravia-Amazon
I need the rows to be rearranged by the websites ie the part after -;
1, name , 1012B-Amazon , 2044C-Flipcart , Bosh27-Walmart
2, name , , Kelv20-Flipcart, LG-Walmart
3, name , Sony-Amazon, Kenstar-Flipcart ,Kenstar-Walmart
4, name , Bravia-Amazon, ,LG18-Walmart
Is it possible using pandas ? Finding the existence of a sting and re arrange it and iterate through all rows and repeat this for the next string ? I went through the documentation of Series.str.contains and str.extract but was unable to find a solution .
Using sorted with key
df.iloc[:,1:].apply(lambda x : sorted(x,key=lambda y: (y=='',y)),1)
2 3 4 5
1 ABC DEF GHI JKL
2 ABC DEF GHI
3 ABC DEF GHI JKL
#df.iloc[:,1:]=df.iloc[:,1:].apply(lambda x : sorted(x,key=lambda y: (y=='',y)),1)
Since you mention reindex I think get_dummies will work
s=pd.get_dummies(df.iloc[:,1:],prefix ='',prefix_sep='')
s=s.drop('',1)
df.iloc[:,1:]=s.mul(s.columns).values
df
1 2 3 4 5
1 name ABC DEF GHI JKL
2 name ABC DEF GHI
3 name ABC DEF GHI JKL
Assuming the empty value is np.nan:
# Fill in the empty values with some string to allow sorting
df.fillna('NaN', inplace=True)
# Flatten the dataframe, do the sorting and reshape back to a dataframe
pd.DataFrame(list(map(sorted, df.values)))
0 1 2 3
0 ABC DEF GHI JKL
1 ABC DEF GHI NaN
2 ABC DEF GHI JKL
UPDATE
Given the update to the question and the sample data being as follows
df = pd.DataFrame({'name': ['name1', 'name2', 'name3', 'name4'],
'b': ['1012B-Amazon', 'Kelvi20-Flipcart', 'Kenstar-Walmart', 'LG18-Walmart'],
'c': ['2044C-Flipcart', 'LG-Walmart', 'Sony-Amazon', 'Bravia-Amazon'],
'd': ['Bosh27-Walmart', np.nan, 'Kenstar-Flipcart', np.nan]})
a possible solution could be
def foo(df, retailer):
# Find cells that contain the name of the retailer
mask = df.where(df.apply(lambda x: x.str.contains(retailer)), '')
# Squash the resulting mask into a series
col = mask.max(skipna=True, axis=1)
# Optional: trim the name of the retailer
col = col.str.replace(f'-{retailer}', '')
return col
df_out = pd.DataFrame(df['name'])
for retailer in ['Amazon', 'Walmart', 'Flipcart']:
df_out[retailer] = foo(df, retailer)
resulting in
name Amazon Walmart Flipcart
0 name1 1012B Bosh27 2044C
1 name2 LG Kelvi20
2 name3 Sony Kenstar Kenstar
3 name4 Bravia LG18
Edit after Question Update:
This is the abc csv:
1,name,ABC,GHI,DEF,JKL
2,name,GHI,DEF,ABC,
3,name,JKL,GHI,ABC,DEF
This is the company csv (it is necessary to watch the commas carefully):
1,name,1012B-Amazon,2044C-Flipcart,Bosh27-Walmart
2,name,Kelvi20-Flipcart,LG-Walmart,
3,name,Kenstar-Walmart,Sony-Amazon,Kenstar-Flipcart
4,name,LG18-Walmart,Bravia-Amazon,
Here is the code
import pandas as pd
import numpy as np
#These solution assume that each value that is not empty is not repeated
#within each row. If that is not the case for your data, it would be possible
#to do some transformations that the non empty values are unique for each row.
#"get_company" returns the company if the value is non-empty and an
#empty value if the value was empty to begin with:
def get_company(company_item):
if pd.isnull(company_item):
return np.nan
else:
company=company_item.split('-')[-1]
return company
#Using the "define_sort_order" function, one can retrieve a template to later
#sort all rows in the sort_abc_rows function. The template is derived from all
#values, aside from empty values, within the matrix when "by_largest_row" = False.
#One could also choose the single largest row to serve as the
#template for all other rows to follow. Both options work similarly when
#all rows are subsets of the largest row i.e. Every element in every
#other row (subset) can be found in the largest row (or set)
#The difference relates to, when the items contain unique elements,
#Whether one wants to create a table with all sorted elements serving
#as the columns, or whether one wants to simply exclude elements
#that are not in the largest row when at least one non-subset row does not exist
#Rather than only having the application of returning the original data rows,
#one can get back a novel template with different values from that of the
#original dataset if one uses a function to operate on the template
def define_sort_order(data,by_largest_row = False,value_filtering_function = None):
if not by_largest_row:
if value_filtering_function:
data = data.applymap(value_filtering_function)
#data.values returns a numpy array
#with rows and columns. .flatten()
#puts all elements in a 1 dim array
#set gets all unique values in the array
filtered_values = list(set((data.values.flatten())))
filtered_values = [data_value for data_value in filtered_values if not_empty(data_value)]
#sorted returns a list, even with np.arrays as inputs
model_row = sorted(filtered_values)
else:
if value_filtering_function:
data = data.applymap(value_filtering_function)
row_lengths = data.apply(lambda data_row: data_row.notnull().sum(),axis = 1)
#locates the numerical index for the row with the most non-empty elements:
model_row_idx = row_lengths.idxmax()
#sort and filter the row with the most values:
filtered_values = list(set(data.iloc[model_row_idx]))
model_row = [data_value for data_value in sorted(filtered_values) if not_empty(data_value)]
return model_row
#"not_empty" is used in the above function in order to filter list models that
#they no empty elements remain
def not_empty(value):
return pd.notnull(value) and value not in ['',' ',None]
#Sorts all element in each _row within their corresponding position within the model row.
#elements in the model row that are missing from the current data_row are replaced with np.nan
def reorder_data_rows(data_row,model_row,check_by_function=None):
#Here, we just apply the same function that we used to find the sorting order that
#we computed when we originally #when we were actually finding the ordering of the model_row.
#We actually transform the values of the data row temporarily to determine whether the
#transformed value is in the model row. If so, we determine where, and order #the function
#below in such a way.
if check_by_function:
sorted_data_row = [np.nan]*len(model_row) #creating an empty vector that is the
#same length as the template, or model_row
data_row = [value for value in data_row.values if not_empty(value)]
for value in data_row:
value_lookup = check_by_function(value)
if value_lookup in model_row:
idx = model_row.index(value_lookup)
#placing company items in their respective row positions as indicated by
#the model_row #
sorted_data_row[idx] = value
else:
sorted_data_row = [value if value in data_row.values else np.nan for value in model_row]
return pd.Series(sorted_data_row)
##################### ABC ######################
#Reading the data:
#the file will automatically include the header as the first row if this the
#header = None option is not included. Note: "name" and the 1,2,3 columns are not in the index.
abc = pd.read_csv("abc.csv",header = None,index_col = None)
# Returns a sorted, non-empty list. IF you hard code the order you want,
# then you can simply put the hard coded order in the second input in model_row and avoid
# all functions aside from sort_abc_rows.
model_row = define_sort_order(abc.iloc[:,2:],False)
#applying the "define_sort_order" function we created earlier to each row before saving back into
#the original dataframe
#lambda allows us to create our own function without giving it a name.
#it is useful in this circumstance in order to use two inputs for sort_abc_rows
abc.iloc[:,2:] = abc.iloc[:,2:].apply(lambda abc_row: reorder_data_rows(abc_row,model_row),axis = 1).values
#Saving to a new csv that won't include the pandas created indices (0,1,2)
#or columns names (0,1,2,3,4):
abc.to_csv("sorted_abc.csv",header = False,index = False)
################################################
################## COMPANY #####################
company = pd.read_csv("company.csv",header=None,index_col=None)
model_row = define_sort_order(company.iloc[:,2:],by_largest_row = False,value_filtering_function=get_company)
#the only thing that changes here is that we tell the sort function what specific
#criteria to use to reorder each row by. We're using the result from the
#get_company function to do so. The custom function get_company, takes an input
#such as Kenstar-Walmart, and outputs Walmart (what's after the "-").
#we would then sort by the resulting list of companies.
#Because we used the define_sort_order function to retrieve companies rather than company items in order,
#We need to use the same function to reorder each element in the DataFrame
company.iloc[:,2:] = company.iloc[:,2:].apply(lambda companies_row: reorder_data_rows(companies_row,model_row,check_by_function=get_company),axis=1).values
company.to_csv("sorted_company.csv",header = False,index = False)
#################################################
Here is the first result from sorted_abc.csv:
1 name ABC DEF GHI JKL
2 name ABC DEF GHI NaN
3 name ABC DEF GHI JKL
After modifying the code to the subsequent form inquired about,
here is the sorted_company.csv that resulted from running the
script.
1 name 1012B-Amazon 2044C-Flipcart Bosh27-Walmart
2 name NaN Kelvi20-Flipcart LG-Walmart
3 name Sony-Amazon Kenstar-Flipcart Kenstar-Walmart
4 name Bravia-Amazon NaN LG18-Walmart
I hope it helps!

Categories

Resources