Related
I'm trying to compare cells within a data frame using pandas.
the data looks like that:
seqnames, start, end, width, strand, s1, s2, s3, sn
1, Ha412HOChr01, 1, 220000, 220000, CN2, CN10, CN2, CN2
2, Ha412HOChr01, 1, 220000, 220000, CN2, CN2, CN2, CN2
3, Ha412HOChr01, 1, 220000, 220000, CN2, CN4, CN2, CN2
n, Ha412HOChr01, 1, 220000, 220000, CN2, CN2, CN2, CN6
I was able to make individual comparisons with the following code
import pandas as pd
df = pd.read_csv("test.csv")
if df.iloc[0,5] != df.iloc[0,6]:
print("yay!")
else:
print("not intersting...")
I would like to iterate a comparison between s1 and all the other s columns, line by line in a loop or in any other more efficient methods.
when i've tried the following code:
df = pd.read_csv("test.csv")
df.columns
#make sure to change in future analysis
ref = df[' Sunflower_14_S8']
all_the_rest = df.drop(['seqnames', ' start', ' end', ' width', ' strand'], axis=1)
#all_the_rest.columns
OP = ref.eq(all_the_rest)
OP.to_csv("OP.csv")
i've got a wired output
0,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False444,False,False,False,False,False,False,False,False,False,False,False,False,False
it seems like it compare all the characters instead of the strings
I'm new to programming and I'm stuck, appreciate your help!
Does this help?
import pandas as pd
# define a list of columns you want to compare
cols = ['s1', 's2', 's3']
# some sample data
df = pd.DataFrame(columns=cols)
df['s1'] = ['CN2', 'CN10', 'CN2', 'CN2']
df['s2'] = ['CN2', 'CN2', 'CN2', 'CN2']
df['s3'] = ['CN2', 'CN2', 'CN2', 'CN6']
# remove 's1' from the list of columns
cols_except_s1 = [x for x in cols if x!='s1']
# create a blank dataframe to hold our comparisons
df_comparison = pd.DataFrame(columns=cols_except_s1)
# iterate through each other column, comparing it against 's1'
for x in cols_except_s1:
comparison_series = df['s1'] == df[x]
df_comparison[x] = comparison_series
# the result is a dataframe that has columns of Boolean values
print(df_comparison)
outputs
s2 s3
0 True True
1 False False
2 True True
3 True False
well 9 hour later i have found a way without using panadas...
df = pd.read_csv("test.csv")
#df.columns
#convertthe data frame to a list
list = df.values.tolist()
for line in list:
lineAVG = sum(line[5:]) / len(line[5:])
ref = (line[5])
if lineAVG - ref > 0.15:
output = line
print(output)
Here is the code to filter a dataframe based on field and wellname using two dropdowns. The filter is applied to pandas dataframe and I want filtered output (common_filter) also to be a type of pandas dataframe. Currently, when the is of widgets type. Is there any way of getting it as dataframe?
The code below is taken from TowardsDataScience and modified a bit.
"unique_sorted_values" function simply returns a list of unique sorted values of passed array, in this case FieldID and WellnameID
import ipywidgets as widgets
# dummy data
df = pd.DataFrame({'FieldID': ['A', 'A', 'A', 'B', 'B', 'B', 'B'],
'WellnameID':['1_A', '1_A', '2_A', '1_B', '1_B', '2_B', '2_B'],
'value': [1, 2, 3, 4, 5, 6, 7]})
output = widgets.Output()
dropdown_field = widgets.Dropdown(options = unique_sorted_values(df.FieldID))
dropdown_wellname = widgets.Dropdown(options = unique_sorted_values(df[df.FieldID==dropdown_field.value].WellnameID))
def common_filtering(field, wellname):
output.clear_output()
common_filter = df[(df.FieldID == field) & (df.WellnameID == wellname)]
with output:
display(common_filter)
def dropdown_field_eventhandler(change):
common_filtering(change.new, dropdown_wellname.value)
def dropdown_wellname_eventhandler(change):
common_filtering(dropdown_field.value, change.new)
dropdown_field.observe(dropdown_field_eventhandler, names='value')
dropdown_wellname.observe(dropdown_wellname_eventhandler, names='value')
input_widgets = widgets.HBox([dropdown_field, dropdown_wellname])
display(input_widgets)
display(output)
You cannot use the return value of the function for the dataframe as the return value is not assigned to anything in the main body of the code (it is passed to the interact as a callback). As you want to create a whole new dataframe (rather than modify an existing one), a simple way would be to use the global keyword on a copied version of the initial data.
After choosing the dropdowns, you should be able to get the filtered dataframe in a cell below and see the impact of the filters. If you need anything more complex, you probably want to construct a class object to track the state of data, apply filters etc.
import ipywidgets as widgets
import pandas as pd
# dummy data
df = pd.DataFrame({'FieldID': ['A', 'A', 'A', 'B', 'B', 'B', 'B'],
'WellnameID':['1_A', '1_A', '2_A', '1_B', '1_B', '2_B', '2_B'],
'value': [1, 2, 3, 4, 5, 6, 7]})
filtered = pd.DataFrame()
output = widgets.Output()
dropdown_field = widgets.Dropdown(options = sorted(df.FieldID.unique()))
dropdown_wellname = widgets.Dropdown(options = sorted(df[df.FieldID==dropdown_field.value].WellnameID.unique()))
def common_filtering(field, wellname):
global filtered
output.clear_output()
filtered = df[(df.FieldID == field) & (df.WellnameID == wellname)]
with output:
display(filtered)
def dropdown_field_eventhandler(change):
common_filtering(change.new, dropdown_wellname.value)
def dropdown_wellname_eventhandler(change):
common_filtering(dropdown_field.value, change.new)
dropdown_field.observe(dropdown_field_eventhandler, names='value')
dropdown_wellname.observe(dropdown_wellname_eventhandler, names='value')
input_widgets = widgets.HBox([dropdown_field, dropdown_wellname])
display(input_widgets)
display(output)
I am trying to duplicate a dictionary multiple times based off of the subsamples value in dictionary test1.
test1={'Subsamples':3}
test2={'Substrate':0,'Incubation Time':0}
test3={'Colonies':0,'Color':0,'Size':0}
if test1['Subsamples']>0:
for x in range(0,test1['Subsamples']):
#Magic happens here
print (test1)
>>>{'Subsamples':3}
print (test2)
>>>{'Substrate1':0,'Incubation Time1':0,'Substrate2':0,'Incubation Time2':0,'Substrate3':0,'Incubation Time4':0}
print(test3)
>>>{'Colonies1':0,'Color1':0,'Size1':0,'Colonies2':0,'Color2':0,'Size2':0,'Colonies3':0,'Color3':0,'Size3':0}
So in the example above the value for key Subsamples is three, so the dictionary is "copied" 3 times with the number added to the end of each key each iteration.
Seems like what you want to do is:
def dict_mult(d, n):
assert n >= 0
ret = {}
for i in range(n):
for k, v in d.items():
ret['%s%s' % (k, i+1)] = v
return ret
However, that looks like a strange thing to do... The dict you produce looks hard to handle by the rest of your program (having to build the keys with string concatenation).
Are you sure you should not rather produce something like
{('Colonies',1):0,('Color',1):0,('Size',1):0,('Colonies',2):0,('Color',2):0,('Size',2):0,('Colonies',3):0,('Color',3):0,('Size',3):0}
or even better
[
{'Colonies':0,'Color':0,'Size':0},
{'Colonies':0,'Color':0,'Size':0},
{'Colonies':0,'Color':0,'Size':0},
]
?
The later being easily produce by
[dict(test3) for _ in range(n)]
You can use my_dict.items() to iterate over keys and values as the same time.
Here is one way to answer :
test1 = {'Subsamples': 3}
test2 = {'Substrate': 0, 'Incubation Time': 0}
test3 = {'Colonies': 0, 'Color': 0, 'Size': 0}
def create_new_dict(dict_number, dict_to_apply):
new_dict = {}
size = dict_number['Subsamples']
if size > 0:
for x in range(0, size):
for key, value in dict_to_apply.items():
new_dict[key + str(x+1)] = value
return new_dict
print(create_new_dict(test1, test2))
# {'Substrate1': 0, 'Incubation Time1': 0, 'Substrate2': 0, 'Incubation Time2': 0, 'Substrate3': 0, 'Incubation Time3': 0}
print(create_new_dict(test1, test3))
# {'Colonies1': 0, 'Color1': 0, 'Size1': 0, 'Colonies2': 0, 'Color2': 0, 'Size2': 0, 'Colonies3': 0, 'Color3': 0, 'Size3': 0}
I am trying to write a function that turns all the non-numerical columns in a data set to numerical form.
The data set is a list of lists.
Here is my code:
def handle_non_numerical_data(data):
def convert_to_numbers(data, index):
items = []
column = [line[0] for line in data]
for item in column:
if item not in items:
items.append(item)
[line[0] = items.index(line[0]) for line in data]
return new_data
for value in data[0]:
if isinstance(value, str):
convert_to_numbers(data, data[0].index(value))
Apparently [line[0] = items.index(line[0]) for line in data] is not valid syntax and I cant figure out how to modify the first column of data while iterating over it.
I can't use numpy because the data will not be in numerical form until after this function is run.
How do I do this and why is it so complicated? I feel like this should be way simpler than it is...
In other words, I want to turn this:
[[M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15],
[M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7],
[F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9]]
into this:
[[0,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15],
[0,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7],
[1,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9]]
Note that the first column was changed from strings to numbers.
Solution
data = [['M',0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15],
['M',0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7],
['F',0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9]]
values = {'M': 0, 'F': 1}
new_data = [[values.get(val, val) for val in line] for line in data]
new_data
Output:
[[0, 0.455, 0.365, 0.095, 0.514, 0.2245, 0.101, 0.15, 15],
[0, 0.35, 0.265, 0.09, 0.2255, 0.0995, 0.0485, 0.07, 7],
[1, 0.53, 0.42, 0.135, 0.677, 0.2565, 0.1415, 0.21, 9]]
Explanation
You can take advantage of Python dictionaries and their get method.
These are values for the strings:
values = {'M': 0, 'F': 1}
You can also add more strings like I with a corresponding value.
If the string is values, you will get the value from the dict:
>>> values.get('M', 'M')
0
Otherwise, you will get the original value:
>>> values.get(10, 10)
10
Rather than indexing (which I'm not sure how it was supposed to work in your example), you can instead create a dictionary mapping for letters to numbers. Something like this should work.
raw_data = [['M',0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15],
['M',0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7],
['F',0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9]]
def handle_non_numerical_data(data):
mapping = {'M': 0, 'F': 1, 'I': 2}
for item in raw_data:
if isinstance(item[0], str):
item[0] = mapping.get(item[0], -1) # Returns -1 if letter not found
return data
run = handle_non_numerical_data(raw_data)
print(run)
This answer will use a dict to store the coding from str to int. It can be preloaded and also investigated after the data has been replaced.
# MODIFIES DATA IN PLACE
data = [['M',0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15],
['M',0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7],
['F',0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9]]
coding_dict = {} # can also preload this {'M': 0, 'F':1}
for row in data:
if row[0] not in coding_dict:
coding_dict[row[0]] = len(coding_dict)
row[0] = coding_dict[row[0]]
I have a questionnaire dataset in which one of the columns (a question) has multiple possible answers. The data for that column is a sting of a list, with multiple possible values from none up to five i.e '[1]' or '[1, 2, 3, 5]'
I am trying to process that column to access the values independently as follows:
def f(x):
if notnull(x):
p = re.compile( '[\[\]\'\s]' )
places = p.sub( '', x ).split( ',' )
place_tally = {'1':0, '2':0, '3':0, '4':0, '5':0}
for place in places:
place_tally[place] += 1
return place_tally
df['places'] = df.where_buy.map(f)
This creates a new column in my dataframe "places" with a dict from the values i.e: {'1': 1, '3': 0, '2': 0, '5': 0, '4': 0} or {'1': 1, '3': 1, '2': 1, '5': 1, '4': 0}
Now what is the most efficient/succinct way to extract that data form the new column? I've tried iterating through the DataFrame with no good results i.e
for row_index, row in df.iterrows():
r = row['places']
if r is not None:
df.ix[row_index]['large_super'] = r['1']
df.ix[row_index]['small_super'] = r['2']
This does not seem to be working.
Thanks.
Is this what you are intending to do?
for i in range(1,6):
df['super_'+str(i)] = df['place'].map(lambda x: x.count(str(i)) )