I have a dataframe with multiple columns of tuple data. I'm trying to normalize the data within the tuple for each row per columns. This is an example with lists, but it should be the same concept for tuples as well-
df = pd.DataFrame(np.random.randn(5, 10), columns=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'])
df['arr1'] = df[['a', 'b', 'c', 'd', 'e']].values.tolist()
df['arr2'] = df[['f', 'g', 'h', 'i', 'j']].values.tolist()
If I wish to normalize each list row for a few columns, I would do this-
df['arr1'] = [preprocessing.scale(row) for row in df['arr1']]
df['arr2'] = [preprocessing.scale(row) for row in df['arr2']]
However, since I have about 100 such columns in my original dataset, I obviously don't want to manually normalize per column. How can I loop across all columns?
You can look through columns in a DataFrame like this to process each column:
for col in df.columns:
df[col] = [preprocessing.scale(row) for row in df[col]]
Of course, this only works if you want to process all of the columns in the DataFrame. If you only want a subset, you could create a list of columns first, or you could drop the other columns.
# Here's an example where you manually specify the columns
cols_to_process = ["arr1", "arr2"]
for col in cols_to_process:
df[col] = [preprocessing.scale(row) for row in df[col]]
# Here's an example where you drop the unwanted columns first
cols_to_drop = ["a", "b", "c"]
df = df.drop(columns=cols_to_drop)
for col in cols_to_process:
df[col] = [preprocessing.scale(row) for row in df[col]]
# Or, if you didn't want to actually drop the columns
# from the original DataFrame you could do it like this:
cols_to_drop = ["a", "b", "c"]
for col in df.drop(columns=cols_to_drop):
df[col] = [preprocessing.scale(row) for row in df[col]]
Related
I have a dictionary mapping strings to lists of strings, for example:
{'A':['A'], 'B':['A', 'B', 'C'], 'C':['B', 'E', 'F']}
I am looking to use this to filter a dataframe, creating new dfs with the name of the df being the key and the columns to be copied containing the string listed as the values.
So dataframe A would contain columns from the original dataframe that contain 'A', dataframe B would contain columns that contain 'A', 'B', 'C'. I know that I need to use regex filtering for selecting the columns but am unsure how to do this.
Use DataFrame.filter with regex - join values by | for regex or - it means for key C are selected columns with B or E or C:
d = {'A':['A'], 'B':['A', 'B', 'C'], 'C':['B', 'E', 'F']}
dfs = {k:df.filter(regex='|'.join(v)) for k, v in d.items()}
I have a pandas Dataframe where one of the columns is full of lists:
import pandas
df = pandas.DataFrame([[1, [a, b, c]],
[2, [d, e, f]],
[3, [a, b, c]]])
And I'd like to make a pivot table that shows the list and a count of occurrences
List Count
[a,b,c] 2
[d,e,f] 1
Because list is a non-hashable type, what aggregate functions could do this?
You can zip a list of rows and a list of counts, then make a dataframe from the zip object:
import pandas
df = pandas.DataFrame([[1, ['a', 'b', 'c']],
[2, ['d', 'e', 'f']],
[3, ['a', 'b', 'c']]])
rows = []
counts = []
for index,row in df.iterrows():
if row[1] not in rows:
rows.append(row[1])
counts.append(1)
else:
counts[rows.index(row[1])] += 1
df = pandas.DataFrame(zip(rows, counts))
print(df)
The solution I ended up using was:
import pandas
df = pandas.DataFrame([[1, ['a', 'b', 'c']],
[2, ['d','e', 'f']],
[3, ['a', 'b', 'c']]])
print(df[1])
df[1] = df[1].map(tuple)
#Thanks Ch3steR
df2 = pandas.pivot_table(df,index=df[1], aggfunc='count')
print(df2)
I have a df with columns a-h, and I wish to create a list of these column values, but in the order of values in another list (list1). list1 corresponds to the index value in df.
df
a b c d e f g h
list1
[3,1,0,5,2,7,4,6]
Desired list
['d', 'b', 'a', 'f', 'c', 'h', 'e', 'g']
You can just do df.columns[list1]:
import pandas as pd
df = pd.DataFrame([], columns=list('abcdefgh'))
list1 = [3,1,0,5,2,7,4,6]
print(df.columns[list1])
# Index(['d', 'b', 'a', 'f', 'c', 'h', 'e', 'g'], dtype='object')
First get a np.array of alphabets
arr = np.array(list('abcdefgh'))
Or in your case, a list of your df columns
arr = np.array(df.columns)
Then use your indices as a indexing mask
arr[[3,1,0]]
out:
['d', 'b', 'a']
Check
df.columns.to_series()[list1].tolist()
I have a csv file separated by tabs:
I need only to focus in the two first columns and find, for example, if the pair A-B appears in the document again as B-A and print A-B if the B-A appears. The same for the rest of pairs.
For the example proposed the output is:
ยท A-B
& C-D
dic ={}
import sys
import os
import pandas as pd
import numpy as np
import csv
colnames = ['col1', 'col2', 'col3', 'col4', 'col5']
data = pd.read_csv('koko.csv', names=colnames, delimiter='\t')
col1 = data.col1.tolist()
col2 = data.col2.tolist()
dataset = list(zip(col1,col2))
for a,b in dataset:
if (a,b) and (b,a) in dataset:
dic [a] = b
print (dic)
output = {'A': 'B', 'B': 'A', 'D': 'C', 'C':'D'}
How can I avoid duplicated (or swapped) results in the dictionary?
Does this work?:
import pandas as pd
import numpy as np
col_1 = ['A', 'B', 'C', 'B', 'D']
col_2 = ['B', 'C', 'D', 'A', 'C']
df = pd.DataFrame(np.column_stack([col_1,col_2]), columns = ['Col1', 'Col2'])
df['combined'] = list(zip(df['Col1'], df['Col2']))
final_set = set(tuple(sorted(t)) for t in df['combined'])
final_set looks like this:
{('C', 'D'), ('A', 'B'), ('B', 'C')}
The output contains more than A-B and C-D because of the second row that has B-C
The below should work,
example df used:
df = pd.DataFrame({'Col1' : ['A','C','D','B','D','A'], 'Col2' : ['B','D','C','A','C','B']})
This is the function I used:
temp = df[['Col1','Col2']].apply(lambda row: sorted(row), axis = 1)
print(temp[['Col1','Col2']].drop_duplicates())
useful links:
checking if a string is in alphabetical order in python
Difference between map, applymap and apply methods in Pandas
Here is one way.
df = pd.DataFrame({'Col1' : ['A','C','D','B','D','A','E'],
'Col2' : ['B','D','C','A','C','B','F']})
df = df.drop_duplicates()\
.apply(sorted, axis=1)\
.loc[df.duplicated(subset=['Col1', 'Col2'], keep=False)]\
.drop_duplicates()
# Col1 Col2
# 0 A B
# 1 C D
Explanation
The steps are:
Remove duplicate rows.
Sort dataframe by row.
Remove unique rows by keeping only duplicates.
Remove duplicate rows again.
I have a dataframe containing strings and NaNs. I want to str.lower() certain columns by name to_lower = ['b', 'd', 'e']. Ideally I could do it with a method on the whole dataframe, rather than with a method on df[to_lower]. I have
df[to_lower] = df[to_lower].apply(lambda x: x.astype(str).str.lower())
but I would like a way to do it without assigning to the selected columns.
df = pd.DataFrame({'a': ['A', 'a'], 'b': ['B', 'b']})
to_lower = ['a']
df2 = df.copy()
df2[to_lower] = df2[to_lower].apply(lambda x: x.astype(str).str.lower())
You can use assign method and unpack the result as keyword argument:
df = pd.DataFrame({'a': ['A', 'a'], 'b': ['B', 'b'], 'c': ['C', 'c']})
to_lower = ['a', 'b']
df.assign(**df[to_lower].apply(lambda x: x.astype(str).str.lower()))
# a b c
#0 a b C
#1 a b c
You want this:
for column in to_lower:
df[column] = df[column].str.lower()
This is far more efficient assuming you have more rows than columns.