I'm trying to find a way to confront the equality of values contained into a different dataframes having different column names.
label = {
'aoo' : ['a', 'b', 'c'],
'boo' : ['a', 'b', 'c'],
'coo' : ['a', 'b', 'c']
'label': ['label', 'label', 'label']
}
unlabel = {
'unlabel1' : ['a', 'b', 'c'],
'unlabel2' : ['a', 'b', 'c'],
'unlabel3': ['a', 'b', 'hhh']
}
label = pd.DataFrame(label)
unlabel = pd.DataFrame(unlabel)
Desired output is a dataframe that contains the column where their values is equal and the column label.
Where a single value is not equal unlabel['unlabel3'] i don't want to keep the values in the output.
desired_output = {
'unlabel1' : ['a', 'b', 'c'],
'unlabel2' : ['a', 'b', 'c'],
'label' : ['label', 'label', 'label']
}
If the labels where numbers I could try np.where but I can't find similar helper for string.
Could you help?
Thanks
You can use pd.merge and specify the columns to merge with left_on and right_on
out = unlabel.merge(label, left_on=['unlabel1', 'unlabel2', 'unlabel3'], right_on=['aoo', 'boo', 'coo'], how='left').drop(['unlabel3', 'aoo', 'boo', 'coo'], axis=1)
print(out)
unlabel1 unlabel2 label
0 a a label
1 b b label
2 c c NaN
Related
I have df after read_excel where some of values (from one column, with strings) are divided. How can i merge them back?
for example:
the df i have
{'CODE': ['A', None, 'B', None, None, 'C'],
'TEXT': ['A', 'a', 'B', 'b', 'b', 'C'],
'NUMBER': ['1', None, '2', None, None,'3']}
the df i want
{'CODE': ['A','B','C'],
'TEXT': ['Aa','Bbb','C'],
'NUMBER': ['1','2','3']}
I can't find the right solution. I tried to import data in different ways but it also did not help
You can forward fill missing values or Nones for groups with aggregate join and first non None value for NUMBER column:
d = {'CODE': ['A', None, 'B', None, None, 'C'],
'TEXT': ['A', 'a', 'B', 'b', 'b', 'C'],
'NUMBER': ['1', None, '2', None, None,'3']}
df = pd.DataFrame(d)
df1 = df.groupby(df['CODE'].ffill()).agg({'TEXT':''.join, 'NUMBER':'first'}).reset_index()
print (df1)
CODE TEXT NUMBER
0 A Aa 1
1 B Bbb 2
2 C C 3
You can generate dictionary:
cols = df.columns.difference(['CODE'])
d1 = dict.fromkeys(cols, 'first')
d1['TEXT'] = ''.join
df1 = df.groupby(df['CODE'].ffill()).agg(d1).reset_index()
I have a Pandas dataframe similar to:
df = pd.DataFrame(['a', 'b', 'c', 'd'], columns=['Col'])
df
Col
0 a
1 b
2 c
3 d
I am trying to convert all rows of this column to a comma-separated string with each value in single quotes, like below:
'a', 'b', 'c', 'd'
I have tried the following with several different combinations, but this is the closest I got:
s = df['Col'].str.cat(sep="', '")
s
"a', 'b', 'c', 'd"
I think that the end result should be:
"'a', 'b', 'c', 'd'"
A quick fix will be
"'" + df['Col1'].str.cat(sep="', '") + "'"
"'a', 'b', 'c', 'd'"
Another alternative is adding each element with an extra quote and then use the default .join;
', '.join([f"'{i}'" for i in df['Col1']])
"'a', 'b', 'c', 'd'"
Try this:
s = df['Col'].tolist()
Try something like this:
df = pd.DataFrame(['a', 'b', 'c', 'd'], columns=['Col1'])
values = df['Col1'].to_list()
with_quotes = ["'"+x+"'" for x in values]
','.join(with_quotes)
Output:
"'a','b','c','d'"
I have a df with columns a-h, and I wish to create a list of these column values, but in the order of values in another list (list1). list1 corresponds to the index value in df.
df
a b c d e f g h
list1
[3,1,0,5,2,7,4,6]
Desired list
['d', 'b', 'a', 'f', 'c', 'h', 'e', 'g']
You can just do df.columns[list1]:
import pandas as pd
df = pd.DataFrame([], columns=list('abcdefgh'))
list1 = [3,1,0,5,2,7,4,6]
print(df.columns[list1])
# Index(['d', 'b', 'a', 'f', 'c', 'h', 'e', 'g'], dtype='object')
First get a np.array of alphabets
arr = np.array(list('abcdefgh'))
Or in your case, a list of your df columns
arr = np.array(df.columns)
Then use your indices as a indexing mask
arr[[3,1,0]]
out:
['d', 'b', 'a']
Check
df.columns.to_series()[list1].tolist()
Is there any pandas method to unfactor a dataframe column? I could not find any in the documentation, but was expecting something similar to unfactor in R language.
I managed to come up with the following code, for reconstructing the column (assuming none of the column values are missing), by using the labels array values as indices of uniques.
orig_col = ['b', 'b', 'a', 'c', 'b']
labels, uniques = pd.factorize(orig_col)
recon_col = np.array([uniques[label] for label in labels]).tolist()
orig_col == recon_col
orig_col = ['b', 'b', 'a', 'c', 'b']
labels, uniques = pd.factorize(orig_col)
# To get original list back
uniques[labels]
# array(['b', 'b', 'a', 'c', 'b'], dtype=object)
Yes we can do it via np.vectorize and create the dict
np.vectorize(dict(zip(range(len(uniques)),uniques)).get)(labels)
array(['b', 'b', 'a', 'c', 'b'], dtype='<U1')
I have a dataframe containing strings and NaNs. I want to str.lower() certain columns by name to_lower = ['b', 'd', 'e']. Ideally I could do it with a method on the whole dataframe, rather than with a method on df[to_lower]. I have
df[to_lower] = df[to_lower].apply(lambda x: x.astype(str).str.lower())
but I would like a way to do it without assigning to the selected columns.
df = pd.DataFrame({'a': ['A', 'a'], 'b': ['B', 'b']})
to_lower = ['a']
df2 = df.copy()
df2[to_lower] = df2[to_lower].apply(lambda x: x.astype(str).str.lower())
You can use assign method and unpack the result as keyword argument:
df = pd.DataFrame({'a': ['A', 'a'], 'b': ['B', 'b'], 'c': ['C', 'c']})
to_lower = ['a', 'b']
df.assign(**df[to_lower].apply(lambda x: x.astype(str).str.lower()))
# a b c
#0 a b C
#1 a b c
You want this:
for column in to_lower:
df[column] = df[column].str.lower()
This is far more efficient assuming you have more rows than columns.