How to re-number strings after sorting a dataframe

How to re-number strings after sorting a dataframe - python

Description:
I have a GUI that allows the user to add variables that are displayed in a dataframe. As the variables are added, they are automatically numbered, ex.'FIELD_0' and 'FIELD_1' etc and each variable has a value associated with it. The data is actually row-based instead of column based, in that the 'FIELD' ids are in column 0 and progress downwards and the corresponding value is in column 1, in the same corresponding row. As shown below:
0 1
0 FIELD_0 HH_5_MILES
1 FIELD_1 POP_5_MILES
The user is able to reorder these values and move them up/down a row. However, it's important that the number ordering remains sequential. So, if the user positions 'FIELD_1' above 'FIELD_0' then it gets re-numbered appropriately. Example:
0 1
0 FIELD_0 POP_5_MILES
1 FIELD_1 HH_5_MILES
Currently, I'm using the below code to perform this adjustment - this same re-numbering occurs with other variable names within the same dataframe.
df = pandas.DataFrame({0:['FIELD_1','FIELD_0']})
variable_list = ['FIELD', 'OPERATOR', 'RESULT']
for var in variable_list:
field_list = ['%s_%s' % (var, _) for _, field_name in enumerate(df[0].isin([var]))]
field_count = 0
for _, field_name in enumerate(df.loc[:, 0]):
if var in field_name:
df.loc[_, 0] = field_list[field_count]
field_count += 1
This gets me the result I want, but it seems a bit inelegant. If there is a better way, I'd love to know what it is.

It appears you're looking to overwrite the Field values so that they always appear in order starting with 0.
We can filter to only rows which str.contains the word FIELD. Then assign those to a list comprehension like field_list.
import pandas as pd
# Modified DF
df = pd.DataFrame({0: ['FIELD_1', 'OTHER_1', 'FIELD_0', 'OTHER_0']})
# Select Where Values are Field
m = df[0].str.contains('FIELD')
# Overwrite field with new values by iterating over the total matches
df.loc[m, 0] = [f'FIELD_{n}' for n in range(m.sum())]
print(df)
df:
0
0 FIELD_0
1 OTHER_1
2 FIELD_1
3 OTHER_0
For multiple variables:
import pandas as pd
# Modified DF
df = pd.DataFrame({0: ['FIELD_1', 'OTHER_1', 'FIELD_0', 'OTHER_0']})
variable_list = ['FIELD', 'OTHER']
for v in variable_list:
# Select Where Values are Field
m = df[0].str.contains(v)
# Overwrite field with new values by iterating over the total matches
df.loc[m, 0] = [f'{v}_{n}' for n in range(m.sum())]
df:
0
0 FIELD_0
1 OTHER_0
2 FIELD_1
3 OTHER_1

You can use sort values as below:
def f(x):
l=x.split('_')[1]
return int(l)
df.sort_values(0, key=lambda col: [f(k) for k in col]).reset_index(drop=True)
0
0 FIELD_0
1 FIELD_1

Related

Create a dataframe from one dictionary and remove a specific character

I would like to know if it is possible to create a dataframe from two dictionaries.
I get two dictionaries like this:
dict= {'MO': ['N-2', 'N-8', 'N-7', 'N-6', 'N-9'], 'MO2': ['N0-6'], 'MO3': ['N-2']}
My result will be like this :
ID NUM
0 MO 'N-2', 'N-8', 'N-7', 'N-6', 'N-9'
1 MO2 'N0-6'
2 MO3 'N-2'
I try to obtain this result but in the column with the value I get [] and I can't remove it
liste_id=list(dict.keys())
liste_num=list(dict.values())
df = pandas.DataFrame({'ID':liste_id,'NUM':liste_num})

Merge the values in the dictionary into a string, before creating the dataframe; this ensures the arrays are of the same length
pd.DataFrame([(key, ", ".join(value))
for key, value in dicts.items()],
columns = ['ID', 'NUM'])
ID NUM
0 MO N-2, N-8, N-7, N-6, N-9
1 MO2 N0-6
2 MO3 N-2

Reorder row values csv pandas

I have a csv file
1 , name , 1012B-Amazon , 2044C-Flipcart , Bosh27-Walmart
2 , name , Kelvi20-Flipcart, LG-Walmart
3, name , Kenstar-Walmart, Sony-Amazon , Kenstar-Flipcart
4, name , LG18-Walmart, Bravia-Amazon
I need the rows to be rearranged by the websites ie the part after -;
1, name , 1012B-Amazon , 2044C-Flipcart , Bosh27-Walmart
2, name , , Kelv20-Flipcart, LG-Walmart
3, name , Sony-Amazon, Kenstar-Flipcart ,Kenstar-Walmart
4, name , Bravia-Amazon, ,LG18-Walmart
Is it possible using pandas ? Finding the existence of a sting and re arrange it and iterate through all rows and repeat this for the next string ? I went through the documentation of Series.str.contains and str.extract but was unable to find a solution .

Using sorted with key
df.iloc[:,1:].apply(lambda x : sorted(x,key=lambda y: (y=='',y)),1)
2 3 4 5
1 ABC DEF GHI JKL
2 ABC DEF GHI
3 ABC DEF GHI JKL
#df.iloc[:,1:]=df.iloc[:,1:].apply(lambda x : sorted(x,key=lambda y: (y=='',y)),1)
Since you mention reindex I think get_dummies will work
s=pd.get_dummies(df.iloc[:,1:],prefix ='',prefix_sep='')
s=s.drop('',1)
df.iloc[:,1:]=s.mul(s.columns).values
df
1 2 3 4 5
1 name ABC DEF GHI JKL
2 name ABC DEF GHI
3 name ABC DEF GHI JKL

Assuming the empty value is np.nan:
# Fill in the empty values with some string to allow sorting
df.fillna('NaN', inplace=True)
# Flatten the dataframe, do the sorting and reshape back to a dataframe
pd.DataFrame(list(map(sorted, df.values)))
0 1 2 3
0 ABC DEF GHI JKL
1 ABC DEF GHI NaN
2 ABC DEF GHI JKL
UPDATE
Given the update to the question and the sample data being as follows
df = pd.DataFrame({'name': ['name1', 'name2', 'name3', 'name4'],
'b': ['1012B-Amazon', 'Kelvi20-Flipcart', 'Kenstar-Walmart', 'LG18-Walmart'],
'c': ['2044C-Flipcart', 'LG-Walmart', 'Sony-Amazon', 'Bravia-Amazon'],
'd': ['Bosh27-Walmart', np.nan, 'Kenstar-Flipcart', np.nan]})
a possible solution could be
def foo(df, retailer):
# Find cells that contain the name of the retailer
mask = df.where(df.apply(lambda x: x.str.contains(retailer)), '')
# Squash the resulting mask into a series
col = mask.max(skipna=True, axis=1)
# Optional: trim the name of the retailer
col = col.str.replace(f'-{retailer}', '')
return col
df_out = pd.DataFrame(df['name'])
for retailer in ['Amazon', 'Walmart', 'Flipcart']:
df_out[retailer] = foo(df, retailer)
resulting in
name Amazon Walmart Flipcart
0 name1 1012B Bosh27 2044C
1 name2 LG Kelvi20
2 name3 Sony Kenstar Kenstar
3 name4 Bravia LG18

Edit after Question Update:
This is the abc csv:
1,name,ABC,GHI,DEF,JKL
2,name,GHI,DEF,ABC,
3,name,JKL,GHI,ABC,DEF
This is the company csv (it is necessary to watch the commas carefully):
1,name,1012B-Amazon,2044C-Flipcart,Bosh27-Walmart
2,name,Kelvi20-Flipcart,LG-Walmart,
3,name,Kenstar-Walmart,Sony-Amazon,Kenstar-Flipcart
4,name,LG18-Walmart,Bravia-Amazon,
Here is the code
import pandas as pd
import numpy as np
#These solution assume that each value that is not empty is not repeated
#within each row. If that is not the case for your data, it would be possible
#to do some transformations that the non empty values are unique for each row.
#"get_company" returns the company if the value is non-empty and an
#empty value if the value was empty to begin with:
def get_company(company_item):
if pd.isnull(company_item):
return np.nan
else:
company=company_item.split('-')[-1]
return company
#Using the "define_sort_order" function, one can retrieve a template to later
#sort all rows in the sort_abc_rows function. The template is derived from all
#values, aside from empty values, within the matrix when "by_largest_row" = False.
#One could also choose the single largest row to serve as the
#template for all other rows to follow. Both options work similarly when
#all rows are subsets of the largest row i.e. Every element in every
#other row (subset) can be found in the largest row (or set)
#The difference relates to, when the items contain unique elements,
#Whether one wants to create a table with all sorted elements serving
#as the columns, or whether one wants to simply exclude elements
#that are not in the largest row when at least one non-subset row does not exist
#Rather than only having the application of returning the original data rows,
#one can get back a novel template with different values from that of the
#original dataset if one uses a function to operate on the template
def define_sort_order(data,by_largest_row = False,value_filtering_function = None):
if not by_largest_row:
if value_filtering_function:
data = data.applymap(value_filtering_function)
#data.values returns a numpy array
#with rows and columns. .flatten()
#puts all elements in a 1 dim array
#set gets all unique values in the array
filtered_values = list(set((data.values.flatten())))
filtered_values = [data_value for data_value in filtered_values if not_empty(data_value)]
#sorted returns a list, even with np.arrays as inputs
model_row = sorted(filtered_values)
else:
if value_filtering_function:
data = data.applymap(value_filtering_function)
row_lengths = data.apply(lambda data_row: data_row.notnull().sum(),axis = 1)
#locates the numerical index for the row with the most non-empty elements:
model_row_idx = row_lengths.idxmax()
#sort and filter the row with the most values:
filtered_values = list(set(data.iloc[model_row_idx]))
model_row = [data_value for data_value in sorted(filtered_values) if not_empty(data_value)]
return model_row
#"not_empty" is used in the above function in order to filter list models that
#they no empty elements remain
def not_empty(value):
return pd.notnull(value) and value not in ['',' ',None]
#Sorts all element in each _row within their corresponding position within the model row.
#elements in the model row that are missing from the current data_row are replaced with np.nan
def reorder_data_rows(data_row,model_row,check_by_function=None):
#Here, we just apply the same function that we used to find the sorting order that
#we computed when we originally #when we were actually finding the ordering of the model_row.
#We actually transform the values of the data row temporarily to determine whether the
#transformed value is in the model row. If so, we determine where, and order #the function
#below in such a way.
if check_by_function:
sorted_data_row = [np.nan]*len(model_row) #creating an empty vector that is the
#same length as the template, or model_row
data_row = [value for value in data_row.values if not_empty(value)]
for value in data_row:
value_lookup = check_by_function(value)
if value_lookup in model_row:
idx = model_row.index(value_lookup)
#placing company items in their respective row positions as indicated by
#the model_row #
sorted_data_row[idx] = value
else:
sorted_data_row = [value if value in data_row.values else np.nan for value in model_row]
return pd.Series(sorted_data_row)
##################### ABC ######################
#Reading the data:
#the file will automatically include the header as the first row if this the
#header = None option is not included. Note: "name" and the 1,2,3 columns are not in the index.
abc = pd.read_csv("abc.csv",header = None,index_col = None)
# Returns a sorted, non-empty list. IF you hard code the order you want,
# then you can simply put the hard coded order in the second input in model_row and avoid
# all functions aside from sort_abc_rows.
model_row = define_sort_order(abc.iloc[:,2:],False)
#applying the "define_sort_order" function we created earlier to each row before saving back into
#the original dataframe
#lambda allows us to create our own function without giving it a name.
#it is useful in this circumstance in order to use two inputs for sort_abc_rows
abc.iloc[:,2:] = abc.iloc[:,2:].apply(lambda abc_row: reorder_data_rows(abc_row,model_row),axis = 1).values
#Saving to a new csv that won't include the pandas created indices (0,1,2)
#or columns names (0,1,2,3,4):
abc.to_csv("sorted_abc.csv",header = False,index = False)
################################################
################## COMPANY #####################
company = pd.read_csv("company.csv",header=None,index_col=None)
model_row = define_sort_order(company.iloc[:,2:],by_largest_row = False,value_filtering_function=get_company)
#the only thing that changes here is that we tell the sort function what specific
#criteria to use to reorder each row by. We're using the result from the
#get_company function to do so. The custom function get_company, takes an input
#such as Kenstar-Walmart, and outputs Walmart (what's after the "-").
#we would then sort by the resulting list of companies.
#Because we used the define_sort_order function to retrieve companies rather than company items in order,
#We need to use the same function to reorder each element in the DataFrame
company.iloc[:,2:] = company.iloc[:,2:].apply(lambda companies_row: reorder_data_rows(companies_row,model_row,check_by_function=get_company),axis=1).values
company.to_csv("sorted_company.csv",header = False,index = False)
#################################################
Here is the first result from sorted_abc.csv:
1 name ABC DEF GHI JKL
2 name ABC DEF GHI NaN
3 name ABC DEF GHI JKL
After modifying the code to the subsequent form inquired about,
here is the sorted_company.csv that resulted from running the
script.
1 name 1012B-Amazon 2044C-Flipcart Bosh27-Walmart
2 name NaN Kelvi20-Flipcart LG-Walmart
3 name Sony-Amazon Kenstar-Flipcart Kenstar-Walmart
4 name Bravia-Amazon NaN LG18-Walmart
I hope it helps!

Pandas Parse DataFrame Field and Maintain ID Field

I have a made-up pandas series that I split on a delimiter:
s2 = pd.Series(['2*C316*first_field_name17*second_field_name16*third_field_name2*N311*field value1*Y5*hello2*O30*0*0*'])
split = s2.str.split('*')
The general logic to parse this string:
Asterisks are the delimiter
Numbers immediately before asterisks identify the length of the following block
Three indicators
C indicates field names will follow
N indicates new field values will follow
O indicates old field values will follow
Numbers immediately after indicators (tough because they are next to numbers before asterisks) identify how many field names or values will follow
The parsing logic and code works on a single pandas series. Therefore, it is less important to understand that than it is to understand applying the logic/code to a dataframe.
I calculate the number of fields in the string (in this case, the 3 in the second block which is C316):
number_of_fields = int(split[0][1][1:int(split[0][0])])
I apply a lot of list splitting to extract the results I need into three separate lists (field names, new values, and old values):
i=2
string_length = int(split[0][1][int(split[0][0]):])
field_names_list = []
while i < number_of_fields + 2:
field_name = split[0][i][0:string_length]
field_names_list.append(field_name)
string_length = int(split[0][i][string_length:])
i+=1
i = 3 + number_of_fields
string_length = int(split[0][2 + number_of_fields][string_length:])
new_values_list = []
while i < 3+number_of_fields*2:
field_name = split[0][i][0:string_length]
new_values_list.append(field_name)
string_length = int(split[0][i][string_length:])
i+=1
i = 4 + number_of_fields*2
string_length = int(split[0][3 + number_of_fields*2][string_length:])
old_values_list = []
while i <= 3 + number_of_fields*3:
old_value = split[0][i][0:string_length]
old_values_list.append(old_value)
if i == 3 + number_of_fields*3:
string_length = 0
else:
string_length = int(split[0][i][string_length:])
i+=1
I combine the lists into a df with three columns:
df = pd.DataFrame(
{'field_name': field_names_list,
'new_value': new_values_list,
'old_value': old_values_list
})
field_name new_value old_value
0 first_field_name field value
1 second_field_name Y
2 third_field_name hello
How would I apply this same process to a df with multiple strings? The df would look like this:
row_id string
0 24 2*C316*first_field_name17*second_field_name16*third_field_name2*N311*field value1*Y5*hello2*O30*0*0*
1 25 2*C316*first_field_name17*second_field_name16*third_field_name2*N311*field value1*Y5*hello2*O30*0*0*
I'm unsure how to maintain the row_id with the eventual columns. The end result should look like this:
row_id field_name new_value old_value
0 24 first_field_name field value
1 24 second_field_name Y
2 24 third_field_name hello
3 25 first_field_name field value
4 25 second_field_name Y
5 25 third_field_name hello
I know I can concatenate multiple dataframes, but that would come after maintaining the row_id. How do I keep the row_id with the corresponding values after a series of list slicing operations?

Pandas:Count accumulated unique values based on another column

I have a simple data frame with IDs and date, like below:
'ID Date
a 2009/12/1
c 2009/12/1
d 2009/12/1
a 2010/4/1
c 2010/5/1
e 2010/5/1
b 2010/12/1
b 2012/3/1
e 2012/7/1
b 2013/1/1
...
...'
I need to count unique values by each month and accumulate them but not counting existing IDs. For instance
`2009/12/1 3
2010/4/1 3
2010/5/1 4
... ...`
I created a loop but not working
`for d in df['date'].drop_duplicates():
c=df[df['date']<=d].ID.nunique()
df2=DataFrame(data=c,index=d)`
Can anyone tell me where is the problem? thanks

You should be using groupby() rather than looping over your data frame. After grouping by the date column, you can count the unique instances of ID using:
df.groupby('Date')['ID'].nunique()
Quick example:
df = pd.DataFrame([['a' ,'2009/12/1'],
['c' ,'2009/12/1'],
['d' ,'2009/12/1'],
['c' ,'2009/12/1'],
['a' ,'2010/4/1'],
['c' ,'2010/5/1'],
['e' ,'2010/5/1']], columns = ['ID','Date'])
df.groupby('Date')['ID'].nunique()
# returns:
# Date
# 2009/12/1 3
# 2010/4/1 1
# 2010/5/1 2

One option is to write a for loop and use a set to hold the cumulative unique IDs:
cumcount = []
cumunique = set()
date = []
for k, g in df.groupby(pd.to_datetime(df.Date)):
cumunique |= set(g.ID) # hold cumulative unique IDs
date.append(g.Date.iat[0]) # get the date variable for each group
cumcount.append(len(cumunique)) # hold cumulative count of unique IDs
pd.DataFrame({"Date": date, "ID": cumcount})

Missing values replace by med/mean in conti var, by mode in categorical var in pandas dataframe -after grouping the data by a column)

I have a pandas dataframe , where all missing values are np.nan, now I am trying to replace these missing values. The last column of my data is " class" , I need to group the data based on the class, then get mean/median/mode (based on data whether data is categorical/ continuos, normal/ not) of that group of a column and replace missing values of the group of the coulmn by respective mean/median/mode.
This is the code I have come up with , which I know is an overkill..
if I could :
group the col of dataframe
get median/mode/mean of groups of the cols
replace the missing of those groups
recombine them back to original df
it would be great.
but currently I landed up , finding replacement values (mean/median/mode) group wise and storing in dict, then seperating the nan tuples and non-nan tuples.. replacing missing values in nan tuples.. and trying to join them back to dataframe (which i donno yet how to do)
def fillMissing(df, dataType):
'''
Args:
df ( 2d array/ Dict):
eg : ('attribute1': [12, 24, 25] , 'attribute2': ['good', 'bad'])
dataTypes (dict): Dictionary of attribute names of df as keys and values 0/1
indicating categorical/continuous variable eg: ('attribute1':1, 'attribute2': 0)
Returns:
dataframe wih missing values filled
writes a file with missing values replaces.
'''
dataLabels = list(df.columns.values)
# the dictionary to hold the values to put in place of nan
replaceValues = {}
for eachlabel in dataLabels:
thisSer = df[eachlabel]
if dataType[eachlabel] == 1: # if its a continuous variable
_,pval = stats.normaltest(thisSer)
groupedd = thisSer.groupby(df['class'])
innerDict ={}
for name, group in groupedd:
if(pval < 0.5):
groupMiddle = group.median() # get the median of the group
else:
groupMiddle = group.mean() # get mean (if group is normal )
innerDict[name.strip()] = groupMiddle
replaceValues[eachlabel] = innerDict
else: # if the series is continuous
# freqCount = collections.Counter(thisSer)
groupedd = thisSer.groupby(df['class'])
innerDict ={}
for name, group in groupedd:
freqC = collections.Counter(group)
mostFreq = freqC.most_common(1) # get the most frequent value of the attribute(grouped by class)
# newGroup = group.replace(np.nan , mostFreq)
innerDict[name.strip()] = mostFreq[0][0].strip()
replaceValues[eachlabel] = innerDict
print replaceValues
# replace the missing values =======================
newfile = open('missingReplaced.csv', 'w')
newdf = df
mask=False
for col in df.columns: mask = mask | df[col].isnull()
# get the dataframe of tuples that contains nulls
dfnulls = df[mask]
dfnotNulls = df[~mask]
for _, row in dfnulls.iterrows():
for colname in dataLabels:
if pd.isnull(row[colname]):
if row['class'].strip() == '>50K':
row[colname] = replaceValues[colname]['>50K']
else:
row[colname] = replaceValues[colname]['<=50K']
newfile.write(str(row[colname]) + ",")
newdf.append(row)
newfile.write("\n")
# here add newdf to dfnotNulls to get finaldf
return finaldf

If I understand correctly, this is mostly in the documentation, but probably not where you'd be looking if you're asking the question. See note regarding mode at the bottom as it is slightly trickier than mean and median.
df = pd.DataFrame({ 'v':[1,2,2,np.nan,3,4,4,np.nan] }, index=[1,1,1,1,2,2,2,2],)
df['v_mean'] = df.groupby(level=0)['v'].transform( lambda x: x.fillna(x.mean()))
df['v_med' ] = df.groupby(level=0)['v'].transform( lambda x: x.fillna(x.median()))
df['v_mode'] = df.groupby(level=0)['v'].transform( lambda x: x.fillna(x.mode()[0]))
df
v v_mean v_med v_mode
1 1 1.000000 1 1
1 2 2.000000 2 2
1 2 2.000000 2 2
1 NaN 1.666667 2 2
2 3 3.000000 3 3
2 4 4.000000 4 4
2 4 4.000000 4 4
2 NaN 3.666667 4 4
Note that mode() may not be unique, unlike mean and median and pandas returns it as a Series for that reason. To deal with that, I just took the simplest route and added [0] in order to extract the first member of the series.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to re-number strings after sorting a dataframe - python

You can use sort values as below: def f(x): l=x.split('_')[1] return int(l) df.sort_values(0, key=lambda col: [f(k) for k in col]).reset_index(drop=True) 0 0 FIELD_0 1 FIELD_1

Related

Create a dataframe from one dictionary and remove a specific character

Reorder row values csv pandas

Pandas Parse DataFrame Field and Maintain ID Field

Pandas:Count accumulated unique values based on another column

Missing values replace by med/mean in conti var, by mode in categorical var in pandas dataframe -after grouping the data by a column)

Categories

Resources