I would like to know if it is possible to create a dataframe from two dictionaries.
I get two dictionaries like this:
dict= {'MO': ['N-2', 'N-8', 'N-7', 'N-6', 'N-9'], 'MO2': ['N0-6'], 'MO3': ['N-2']}
My result will be like this :
ID NUM
0 MO 'N-2', 'N-8', 'N-7', 'N-6', 'N-9'
1 MO2 'N0-6'
2 MO3 'N-2'
I try to obtain this result but in the column with the value I get [] and I can't remove it
liste_id=list(dict.keys())
liste_num=list(dict.values())
df = pandas.DataFrame({'ID':liste_id,'NUM':liste_num})
Merge the values in the dictionary into a string, before creating the dataframe; this ensures the arrays are of the same length
pd.DataFrame([(key, ", ".join(value))
for key, value in dicts.items()],
columns = ['ID', 'NUM'])
ID NUM
0 MO N-2, N-8, N-7, N-6, N-9
1 MO2 N0-6
2 MO3 N-2
I have a csv file
1 , name , 1012B-Amazon , 2044C-Flipcart , Bosh27-Walmart
2 , name , Kelvi20-Flipcart, LG-Walmart
3, name , Kenstar-Walmart, Sony-Amazon , Kenstar-Flipcart
4, name , LG18-Walmart, Bravia-Amazon
I need the rows to be rearranged by the websites ie the part after -;
1, name , 1012B-Amazon , 2044C-Flipcart , Bosh27-Walmart
2, name , , Kelv20-Flipcart, LG-Walmart
3, name , Sony-Amazon, Kenstar-Flipcart ,Kenstar-Walmart
4, name , Bravia-Amazon, ,LG18-Walmart
Is it possible using pandas ? Finding the existence of a sting and re arrange it and iterate through all rows and repeat this for the next string ? I went through the documentation of Series.str.contains and str.extract but was unable to find a solution .
Using sorted with key
df.iloc[:,1:].apply(lambda x : sorted(x,key=lambda y: (y=='',y)),1)
2 3 4 5
1 ABC DEF GHI JKL
2 ABC DEF GHI
3 ABC DEF GHI JKL
#df.iloc[:,1:]=df.iloc[:,1:].apply(lambda x : sorted(x,key=lambda y: (y=='',y)),1)
Since you mention reindex I think get_dummies will work
s=pd.get_dummies(df.iloc[:,1:],prefix ='',prefix_sep='')
s=s.drop('',1)
df.iloc[:,1:]=s.mul(s.columns).values
df
1 2 3 4 5
1 name ABC DEF GHI JKL
2 name ABC DEF GHI
3 name ABC DEF GHI JKL
Assuming the empty value is np.nan:
# Fill in the empty values with some string to allow sorting
df.fillna('NaN', inplace=True)
# Flatten the dataframe, do the sorting and reshape back to a dataframe
pd.DataFrame(list(map(sorted, df.values)))
0 1 2 3
0 ABC DEF GHI JKL
1 ABC DEF GHI NaN
2 ABC DEF GHI JKL
UPDATE
Given the update to the question and the sample data being as follows
df = pd.DataFrame({'name': ['name1', 'name2', 'name3', 'name4'],
'b': ['1012B-Amazon', 'Kelvi20-Flipcart', 'Kenstar-Walmart', 'LG18-Walmart'],
'c': ['2044C-Flipcart', 'LG-Walmart', 'Sony-Amazon', 'Bravia-Amazon'],
'd': ['Bosh27-Walmart', np.nan, 'Kenstar-Flipcart', np.nan]})
a possible solution could be
def foo(df, retailer):
# Find cells that contain the name of the retailer
mask = df.where(df.apply(lambda x: x.str.contains(retailer)), '')
# Squash the resulting mask into a series
col = mask.max(skipna=True, axis=1)
# Optional: trim the name of the retailer
col = col.str.replace(f'-{retailer}', '')
return col
df_out = pd.DataFrame(df['name'])
for retailer in ['Amazon', 'Walmart', 'Flipcart']:
df_out[retailer] = foo(df, retailer)
resulting in
name Amazon Walmart Flipcart
0 name1 1012B Bosh27 2044C
1 name2 LG Kelvi20
2 name3 Sony Kenstar Kenstar
3 name4 Bravia LG18
Edit after Question Update:
This is the abc csv:
1,name,ABC,GHI,DEF,JKL
2,name,GHI,DEF,ABC,
3,name,JKL,GHI,ABC,DEF
This is the company csv (it is necessary to watch the commas carefully):
1,name,1012B-Amazon,2044C-Flipcart,Bosh27-Walmart
2,name,Kelvi20-Flipcart,LG-Walmart,
3,name,Kenstar-Walmart,Sony-Amazon,Kenstar-Flipcart
4,name,LG18-Walmart,Bravia-Amazon,
Here is the code
import pandas as pd
import numpy as np
#These solution assume that each value that is not empty is not repeated
#within each row. If that is not the case for your data, it would be possible
#to do some transformations that the non empty values are unique for each row.
#"get_company" returns the company if the value is non-empty and an
#empty value if the value was empty to begin with:
def get_company(company_item):
if pd.isnull(company_item):
return np.nan
else:
company=company_item.split('-')[-1]
return company
#Using the "define_sort_order" function, one can retrieve a template to later
#sort all rows in the sort_abc_rows function. The template is derived from all
#values, aside from empty values, within the matrix when "by_largest_row" = False.
#One could also choose the single largest row to serve as the
#template for all other rows to follow. Both options work similarly when
#all rows are subsets of the largest row i.e. Every element in every
#other row (subset) can be found in the largest row (or set)
#The difference relates to, when the items contain unique elements,
#Whether one wants to create a table with all sorted elements serving
#as the columns, or whether one wants to simply exclude elements
#that are not in the largest row when at least one non-subset row does not exist
#Rather than only having the application of returning the original data rows,
#one can get back a novel template with different values from that of the
#original dataset if one uses a function to operate on the template
def define_sort_order(data,by_largest_row = False,value_filtering_function = None):
if not by_largest_row:
if value_filtering_function:
data = data.applymap(value_filtering_function)
#data.values returns a numpy array
#with rows and columns. .flatten()
#puts all elements in a 1 dim array
#set gets all unique values in the array
filtered_values = list(set((data.values.flatten())))
filtered_values = [data_value for data_value in filtered_values if not_empty(data_value)]
#sorted returns a list, even with np.arrays as inputs
model_row = sorted(filtered_values)
else:
if value_filtering_function:
data = data.applymap(value_filtering_function)
row_lengths = data.apply(lambda data_row: data_row.notnull().sum(),axis = 1)
#locates the numerical index for the row with the most non-empty elements:
model_row_idx = row_lengths.idxmax()
#sort and filter the row with the most values:
filtered_values = list(set(data.iloc[model_row_idx]))
model_row = [data_value for data_value in sorted(filtered_values) if not_empty(data_value)]
return model_row
#"not_empty" is used in the above function in order to filter list models that
#they no empty elements remain
def not_empty(value):
return pd.notnull(value) and value not in ['',' ',None]
#Sorts all element in each _row within their corresponding position within the model row.
#elements in the model row that are missing from the current data_row are replaced with np.nan
def reorder_data_rows(data_row,model_row,check_by_function=None):
#Here, we just apply the same function that we used to find the sorting order that
#we computed when we originally #when we were actually finding the ordering of the model_row.
#We actually transform the values of the data row temporarily to determine whether the
#transformed value is in the model row. If so, we determine where, and order #the function
#below in such a way.
if check_by_function:
sorted_data_row = [np.nan]*len(model_row) #creating an empty vector that is the
#same length as the template, or model_row
data_row = [value for value in data_row.values if not_empty(value)]
for value in data_row:
value_lookup = check_by_function(value)
if value_lookup in model_row:
idx = model_row.index(value_lookup)
#placing company items in their respective row positions as indicated by
#the model_row #
sorted_data_row[idx] = value
else:
sorted_data_row = [value if value in data_row.values else np.nan for value in model_row]
return pd.Series(sorted_data_row)
##################### ABC ######################
#Reading the data:
#the file will automatically include the header as the first row if this the
#header = None option is not included. Note: "name" and the 1,2,3 columns are not in the index.
abc = pd.read_csv("abc.csv",header = None,index_col = None)
# Returns a sorted, non-empty list. IF you hard code the order you want,
# then you can simply put the hard coded order in the second input in model_row and avoid
# all functions aside from sort_abc_rows.
model_row = define_sort_order(abc.iloc[:,2:],False)
#applying the "define_sort_order" function we created earlier to each row before saving back into
#the original dataframe
#lambda allows us to create our own function without giving it a name.
#it is useful in this circumstance in order to use two inputs for sort_abc_rows
abc.iloc[:,2:] = abc.iloc[:,2:].apply(lambda abc_row: reorder_data_rows(abc_row,model_row),axis = 1).values
#Saving to a new csv that won't include the pandas created indices (0,1,2)
#or columns names (0,1,2,3,4):
abc.to_csv("sorted_abc.csv",header = False,index = False)
################################################
################## COMPANY #####################
company = pd.read_csv("company.csv",header=None,index_col=None)
model_row = define_sort_order(company.iloc[:,2:],by_largest_row = False,value_filtering_function=get_company)
#the only thing that changes here is that we tell the sort function what specific
#criteria to use to reorder each row by. We're using the result from the
#get_company function to do so. The custom function get_company, takes an input
#such as Kenstar-Walmart, and outputs Walmart (what's after the "-").
#we would then sort by the resulting list of companies.
#Because we used the define_sort_order function to retrieve companies rather than company items in order,
#We need to use the same function to reorder each element in the DataFrame
company.iloc[:,2:] = company.iloc[:,2:].apply(lambda companies_row: reorder_data_rows(companies_row,model_row,check_by_function=get_company),axis=1).values
company.to_csv("sorted_company.csv",header = False,index = False)
#################################################
Here is the first result from sorted_abc.csv:
1 name ABC DEF GHI JKL
2 name ABC DEF GHI NaN
3 name ABC DEF GHI JKL
After modifying the code to the subsequent form inquired about,
here is the sorted_company.csv that resulted from running the
script.
1 name 1012B-Amazon 2044C-Flipcart Bosh27-Walmart
2 name NaN Kelvi20-Flipcart LG-Walmart
3 name Sony-Amazon Kenstar-Flipcart Kenstar-Walmart
4 name Bravia-Amazon NaN LG18-Walmart
I hope it helps!
I have a made-up pandas series that I split on a delimiter:
s2 = pd.Series(['2*C316*first_field_name17*second_field_name16*third_field_name2*N311*field value1*Y5*hello2*O30*0*0*'])
split = s2.str.split('*')
The general logic to parse this string:
Asterisks are the delimiter
Numbers immediately before asterisks identify the length of the following block
Three indicators
C indicates field names will follow
N indicates new field values will follow
O indicates old field values will follow
Numbers immediately after indicators (tough because they are next to numbers before asterisks) identify how many field names or values will follow
The parsing logic and code works on a single pandas series. Therefore, it is less important to understand that than it is to understand applying the logic/code to a dataframe.
I calculate the number of fields in the string (in this case, the 3 in the second block which is C316):
number_of_fields = int(split[0][1][1:int(split[0][0])])
I apply a lot of list splitting to extract the results I need into three separate lists (field names, new values, and old values):
i=2
string_length = int(split[0][1][int(split[0][0]):])
field_names_list = []
while i < number_of_fields + 2:
field_name = split[0][i][0:string_length]
field_names_list.append(field_name)
string_length = int(split[0][i][string_length:])
i+=1
i = 3 + number_of_fields
string_length = int(split[0][2 + number_of_fields][string_length:])
new_values_list = []
while i < 3+number_of_fields*2:
field_name = split[0][i][0:string_length]
new_values_list.append(field_name)
string_length = int(split[0][i][string_length:])
i+=1
i = 4 + number_of_fields*2
string_length = int(split[0][3 + number_of_fields*2][string_length:])
old_values_list = []
while i <= 3 + number_of_fields*3:
old_value = split[0][i][0:string_length]
old_values_list.append(old_value)
if i == 3 + number_of_fields*3:
string_length = 0
else:
string_length = int(split[0][i][string_length:])
i+=1
I combine the lists into a df with three columns:
df = pd.DataFrame(
{'field_name': field_names_list,
'new_value': new_values_list,
'old_value': old_values_list
})
field_name new_value old_value
0 first_field_name field value
1 second_field_name Y
2 third_field_name hello
How would I apply this same process to a df with multiple strings? The df would look like this:
row_id string
0 24 2*C316*first_field_name17*second_field_name16*third_field_name2*N311*field value1*Y5*hello2*O30*0*0*
1 25 2*C316*first_field_name17*second_field_name16*third_field_name2*N311*field value1*Y5*hello2*O30*0*0*
I'm unsure how to maintain the row_id with the eventual columns. The end result should look like this:
row_id field_name new_value old_value
0 24 first_field_name field value
1 24 second_field_name Y
2 24 third_field_name hello
3 25 first_field_name field value
4 25 second_field_name Y
5 25 third_field_name hello
I know I can concatenate multiple dataframes, but that would come after maintaining the row_id. How do I keep the row_id with the corresponding values after a series of list slicing operations?
I have a simple data frame with IDs and date, like below:
'ID Date
a 2009/12/1
c 2009/12/1
d 2009/12/1
a 2010/4/1
c 2010/5/1
e 2010/5/1
b 2010/12/1
b 2012/3/1
e 2012/7/1
b 2013/1/1
...
...'
I need to count unique values by each month and accumulate them but not counting existing IDs. For instance
`2009/12/1 3
2010/4/1 3
2010/5/1 4
... ...`
I created a loop but not working
`for d in df['date'].drop_duplicates():
c=df[df['date']<=d].ID.nunique()
df2=DataFrame(data=c,index=d)`
Can anyone tell me where is the problem? thanks
You should be using groupby() rather than looping over your data frame. After grouping by the date column, you can count the unique instances of ID using:
df.groupby('Date')['ID'].nunique()
Quick example:
df = pd.DataFrame([['a' ,'2009/12/1'],
['c' ,'2009/12/1'],
['d' ,'2009/12/1'],
['c' ,'2009/12/1'],
['a' ,'2010/4/1'],
['c' ,'2010/5/1'],
['e' ,'2010/5/1']], columns = ['ID','Date'])
df.groupby('Date')['ID'].nunique()
# returns:
# Date
# 2009/12/1 3
# 2010/4/1 1
# 2010/5/1 2
One option is to write a for loop and use a set to hold the cumulative unique IDs:
cumcount = []
cumunique = set()
date = []
for k, g in df.groupby(pd.to_datetime(df.Date)):
cumunique |= set(g.ID) # hold cumulative unique IDs
date.append(g.Date.iat[0]) # get the date variable for each group
cumcount.append(len(cumunique)) # hold cumulative count of unique IDs
pd.DataFrame({"Date": date, "ID": cumcount})
I have a pandas dataframe , where all missing values are np.nan, now I am trying to replace these missing values. The last column of my data is " class" , I need to group the data based on the class, then get mean/median/mode (based on data whether data is categorical/ continuos, normal/ not) of that group of a column and replace missing values of the group of the coulmn by respective mean/median/mode.
This is the code I have come up with , which I know is an overkill..
if I could :
group the col of dataframe
get median/mode/mean of groups of the cols
replace the missing of those groups
recombine them back to original df
it would be great.
but currently I landed up , finding replacement values (mean/median/mode) group wise and storing in dict, then seperating the nan tuples and non-nan tuples.. replacing missing values in nan tuples.. and trying to join them back to dataframe (which i donno yet how to do)
def fillMissing(df, dataType):
'''
Args:
df ( 2d array/ Dict):
eg : ('attribute1': [12, 24, 25] , 'attribute2': ['good', 'bad'])
dataTypes (dict): Dictionary of attribute names of df as keys and values 0/1
indicating categorical/continuous variable eg: ('attribute1':1, 'attribute2': 0)
Returns:
dataframe wih missing values filled
writes a file with missing values replaces.
'''
dataLabels = list(df.columns.values)
# the dictionary to hold the values to put in place of nan
replaceValues = {}
for eachlabel in dataLabels:
thisSer = df[eachlabel]
if dataType[eachlabel] == 1: # if its a continuous variable
_,pval = stats.normaltest(thisSer)
groupedd = thisSer.groupby(df['class'])
innerDict ={}
for name, group in groupedd:
if(pval < 0.5):
groupMiddle = group.median() # get the median of the group
else:
groupMiddle = group.mean() # get mean (if group is normal )
innerDict[name.strip()] = groupMiddle
replaceValues[eachlabel] = innerDict
else: # if the series is continuous
# freqCount = collections.Counter(thisSer)
groupedd = thisSer.groupby(df['class'])
innerDict ={}
for name, group in groupedd:
freqC = collections.Counter(group)
mostFreq = freqC.most_common(1) # get the most frequent value of the attribute(grouped by class)
# newGroup = group.replace(np.nan , mostFreq)
innerDict[name.strip()] = mostFreq[0][0].strip()
replaceValues[eachlabel] = innerDict
print replaceValues
# replace the missing values =======================
newfile = open('missingReplaced.csv', 'w')
newdf = df
mask=False
for col in df.columns: mask = mask | df[col].isnull()
# get the dataframe of tuples that contains nulls
dfnulls = df[mask]
dfnotNulls = df[~mask]
for _, row in dfnulls.iterrows():
for colname in dataLabels:
if pd.isnull(row[colname]):
if row['class'].strip() == '>50K':
row[colname] = replaceValues[colname]['>50K']
else:
row[colname] = replaceValues[colname]['<=50K']
newfile.write(str(row[colname]) + ",")
newdf.append(row)
newfile.write("\n")
# here add newdf to dfnotNulls to get finaldf
return finaldf
If I understand correctly, this is mostly in the documentation, but probably not where you'd be looking if you're asking the question. See note regarding mode at the bottom as it is slightly trickier than mean and median.
df = pd.DataFrame({ 'v':[1,2,2,np.nan,3,4,4,np.nan] }, index=[1,1,1,1,2,2,2,2],)
df['v_mean'] = df.groupby(level=0)['v'].transform( lambda x: x.fillna(x.mean()))
df['v_med' ] = df.groupby(level=0)['v'].transform( lambda x: x.fillna(x.median()))
df['v_mode'] = df.groupby(level=0)['v'].transform( lambda x: x.fillna(x.mode()[0]))
df
v v_mean v_med v_mode
1 1 1.000000 1 1
1 2 2.000000 2 2
1 2 2.000000 2 2
1 NaN 1.666667 2 2
2 3 3.000000 3 3
2 4 4.000000 4 4
2 4 4.000000 4 4
2 NaN 3.666667 4 4
Note that mode() may not be unique, unlike mean and median and pandas returns it as a Series for that reason. To deal with that, I just took the simplest route and added [0] in order to extract the first member of the series.