I am trying to get the column names which have cell values less than .2, without repeating a combination of columns. I tried this to iterate over the column names without success:
pvals2=pd.DataFrame({'col1': [1, .2,.7],
'col2': [.2, 1,.01],
'col3': [.7,.01,1]},
index = ['col1', 'col2', 'col3'])
print(pvals2)
print('---')
pvals2.transpose().join(pvals2, how='outer')
My goal is:
col3 col2 .01
#col2 col3 .01 #NOT INCLUDED (because it it a repeat)
A list comprehension is one way:
pvals2 = pd.DataFrame({'col1': [1, .2,.7], 'col2': [.2, 1,.01], 'col3': [.7,.01,1]},
index = ['col1', 'col2', 'col3'])
res = [col for col in pvals2 if (pvals2[col] < 0.2).any()]
# ['col2', 'col3']
To get values as well, as in your desired output, requires more specification, as a column may have more than one value less than 0.2.
Iterate through the columns and check if any value meets your conditions:
pvals2=pd.DataFrame({'col1': [1, .2,.7],
'col2': [.2, 1,.01],
'col3': [.7,.01,1]})
cols_with_small_values = set()
for col in pvals2.columns:
if any(i < 0.2 for i in pvals2[col]):
cols_with_small_values.add(col)
cols_with_small_values.add(pvals2[col].min())
print(cols_with_small_values)
RESULT: {'col3', 0.01, 'col2'}
any is a built-in. This question has a good explanation for how any works. And we can use a set to assure each column will only appear once.
We use DataFrame.min() to get the small value that caused us to select this column.
You could use stack and then filter out values < 0.2. Then keep the last duplicated value
pvals2.stack()[pvals2.stack().lt(.2)].drop_duplicates(keep='last')
col3 col2 0.01
dtype: float64
pvals2=pd.DataFrame({'col1': [1, .2,.7],
'col2': [.2, 1,.01],
'col3': [.7,.01,1]},
index = ['col1', 'col2', 'col3'])
pvals2.min().where(lambda x : x<0.1).dropna()
Output
col2 0.01
col3 0.01
dtype: float64
Related
I have been looking for an answer without success (1,2,3) and a lot of the questions I have found about string aggregation involves only string aggregation when all the columns are strings. This is a mixed aggregation with some specific details.
The df:
df = pd.DataFrame({
'Group': ['Group_1', 'Group_1','Group_1', 'Group_1', 'Group_2', 'Group_2', 'Group_2', 'Group_2', 'Group_2', 'Group_2'],
'Col1': ['A','A','B',np.nan,'B','B','C','C','C','C'],
'Col2': [1,2,3,3,5,5,5,7,np.nan,7],
'Col3': [np.nan, np.nan, np.nan,np.nan,3,2,3,4,5,5],
'Col4_to_Col99': ['some value','some value','some value','some value','some value','some value','some value','some value','some value','some value']
})
The function used for the output (source):
def join_non_nan_values(elements):
return ";".join([elem for elem in elements if elem == elem]) # elem == elem will fail for Nan values
The output:
df.groupby('Group')[['Col1', 'Col2', 'Col3']].agg({'Col1': join_non_nan_values, 'Col2': 'count', 'Col3':'mean'})
The output expected:
The output for Col1 and Col2 is a counting. The left side is the value, the right side is the count.
PD: If you know a more efficient way to implement join_non_nan_values function, you are welcome! (As it takes a while for it to run actually..) Just remember that it needs to skips missing values
You can try this:
def f(x):
c = x.value_counts().sort_index()
return ";".join(f"{k}:{v}" for (k, v) in c.items())
df["Col2"] = df["Col2"].astype('Int64')
df.groupby("Group")[["Col1", "Col2", "Col3"]].agg({
"Col1": f,
"Col2": f,
"Col3": 'mean'
})
It gives:
Col1 Col2 Col3
Group
Group_1 A:2;B:1 1:1;2:1;3:2 NaN
Group_2 B:2;C:4 5:3;7:2 3.666667
You can try calling value_counts() inside groupby().apply() and convert the outcome into strings using the str.join() method. To have a Frame (not a Series) returned as an output, use as_index=False parameter in groupby().
def func(g):
"""
(i) Count the values in Col1 and Col2 columns by calling value_counts() on each column
and convert the output into strings via join() method
(ii) Calculate mean of Col3
"""
col1 = ';'.join([f'{k}:{v}' for k,v in g['Col1'].value_counts(sort=False).items()])
col2 = ';'.join([f'{int(k)}:{v}' for k,v in g['Col2'].value_counts(sort=False).items()])
col3 = g['Col3'].mean()
return col1, col2, col3
# group by Group and apply func to specific columns
result = df.groupby('Group', as_index=False)[['Col1','Col2','Col3']].apply(func)
result
I am searching for a way to map some numeric columns to categorical features.
All columns are of categorical nature but are represented as integers. However I need them to be a "String".
e.g.
col1 col2 col3 -> col1new col2new col3new
0 1 1 -> "0" "1" "1"
2 2 3 -> "2" "2" "3"
1 3 2 -> "1" "3" "2"
It does not matter what kind of String the new column contains as long as all distinct values from the original data set map to the same new String value.
Any ideas?
I have a bumpy representation of my data right now but any pandas solution would be also helpful.
Thanks a lot!
You can use applymap method. Cosider the following example:
df = pd.DataFrame({'col1': [0, 2, 1], 'col2': [1, 2, 3], 'col3': [1, 3, 2]})
df.applymap(str)
col1 col2 col3
0 0 1 1
1 2 2 3
2 1 3 2
You can convert all elements of col1, col2, and col3 to str using the following command:
df = df.applymap(str)
you can modify the type of the elements in a list by using the dataframe.apply function which is offered by pandas-dataframe-apply.
frame = pd.DataFrame(np.random.randint(0, 90, size =(5000000, 3)), columns =['col1', 'col2', 'col3'])
in the new dataframe you can define columns and the value by:
updated_frame = pd.DataFrame(np.random.randint(0, 90, size =(5000000, 3)), columns =['col1new', 'col2new', 'col3new'])
updated_frame['col1new'] = frame['col1'].apply(str)
updated_frame['col2new'] = frame['col2'].apply(str)
updated_frame['col3new'] = frame['col3'].apply(str)
You could use the .astype method. If you want to replace all the current columns with a string version then you could do (df your dataframe):
df = df.astype(str)
If you want to add the string columns as new ones:
df = df.assign(**{f"{col}new": df[col].astype(str) for col in df.columns})
Let's say I have this dataframe:
df
col1 col2 col3 col4
1 apple NaN apple
2 NaN False 1.3
NaN orange True NaN
I'd like to get a list of all types in each column, excluding the NaN/null cells. Output Could be as a dictionary like this:
{'col1': int, 'col2': str, 'col3':bool, 'col4': [str,float]}
I've gotten as far as creating a dictionary that outputs all the strings in each column including the NaN values. I'm not sure how to exclude the NaNs.
output = {}
for col in df.columns.values.tolist():
list_types = [x.__name__ for x in df[col].apply(type).unique()]
output[col] = list_types
The code above would get me almost what I want, but with a bunch of extra "float"s for the NaNs:
{'col1': [int,float], 'col2': [str,float], 'col3':[bool,float], 'col4': [str,float]}
For excluding nan, do
df = df.dropna()
Then for getting data types:
df.dtypes
In the below approach, what I have done is that I extracted the Non-NAN items to a list and then found the dtype of the remaining list:
#initialize the empty column
output={}
#loop over the columns
for column in df:
a=[x for x in df[column] if str(x)!= 'nan']
output[column]=type(a[0])
Try with stack, this will drop the NaN , then we do groupby + unqiue
df.stack().apply(lambda x : type(x).__name__).groupby(level=1).unique().to_dict()
{'col1': array(['float'], dtype=object), 'col2': array(['str'], dtype=object), 'col3': array(['bool'], dtype=object), 'col4': array(['str'], dtype=object)}
I wanted to calculate the percent of some object in one hour ('Time') so I have tried to write a lambda function, and I think it does the job, but index columns disappeared, columns that dataframe is grouped by.
df = df.groupby(['id', 'name', 'time', 'object', 'type'], as_index=True, sort=False)['col1', 'col2', 'col3', 'col4', 'col5'].apply(lambda x: x * 100 / 3600).reset_index()
After that code I print df.columns and got this:
Index([u'index', u'col1', col2', u'col3',
u'col4', u'col5'],
dtype='object')
If there is a need I am going to write some table with values for each column.
Thanks in advance.
Moving the loop outward, will make the code run significantly faster:
for c in ['col1', 'col2', 'col3', 'col4', 'col5']:
df[c] *= 100. / 3600
This is because the individual loops' calculations will be done in a vectorized way.
This also won't modify the index in any way.
pd.DataFrame.groupby is used to aggregate data, not to apply a function to multiple columns.
For simple functions, you should look for a vectorised solution. For example:
# set up simple dataframe
df = pd.DataFrame({'id': [1, 2, 1], 'name': ['A', 'B', 'A'],
'col1': [5, 6, 8], 'col2': [9, 4, 5]})
# apply logic in a vectorised way on multiple columns
df[['col1', 'col2']] = df[['col1', 'col2']].values * 100 / 3600
If you wish to set your index as multiple columns, and are keen to use pd.DataFrame.apply, this is possible as two separate steps. For example:
df = df.set_index(['id', 'name'])
df[['col1', 'col2']] = df[['col1', 'col2']].apply(lambda x: x * 100 / 3600)
You apply .reset_index() which resets the index. Take a look at the pandas documentation and you'll see, that .reset_index() transfers the index to the columns.
Data from Jpp
df[['col1','col2']]*=100/3600
df
Out[110]:
col1 col2 id name
0 0.138889 0.250000 1 A
1 0.166667 0.111111 2 B
2 0.222222 0.138889 1 A
I have 3 dataframes that I'd like to combine. They look like this:
df1 |df2 |df3
col1 col2 |col1 col2 |col1 col3
1 5 2 9 1 some
2 data
I'd like the first two df-s to be merged into the third df based on col1, so the desired output is
df3
col1 col3 col2
1 some 5
2 data 9
How can I achieve this? I'm trying:
df3['col2'] = df1[df1.col1 == df3.col1].col2 if df1[df1.col1 == df3.col1].col2 is not None else df2[df2.col1 == df3.col1].col2
For this I get ValueError: Series lengths must match to compare
It is guaranteed, that df3's col1 values are present either in df1 or df2. What's the way to do this? PLEASE NOTE, that a simple concat will not work, since there is other data in df3, not just col1.
If df1 and df2 don't have duplicates in col1, you can try this:
pd.concat([df1, df2]).merge(df3)
Data:
df1 = pd.DataFrame({'col1': [1], 'col2': [5]})
df2 = pd.DataFrame({'col1': [2], 'col2': [9]})
df3 = pd.DataFrame({'col1': [1,2], 'col3': ['some', 'data']})