Processing multiple values string in Pandas DataFrame column - python

I have a questionnaire dataset in which one of the columns (a question) has multiple possible answers. The data for that column is a sting of a list, with multiple possible values from none up to five i.e '[1]' or '[1, 2, 3, 5]'
I am trying to process that column to access the values independently as follows:
def f(x):
if notnull(x):
p = re.compile( '[\[\]\'\s]' )
places = p.sub( '', x ).split( ',' )
place_tally = {'1':0, '2':0, '3':0, '4':0, '5':0}
for place in places:
place_tally[place] += 1
return place_tally
df['places'] = df.where_buy.map(f)
This creates a new column in my dataframe "places" with a dict from the values i.e: {'1': 1, '3': 0, '2': 0, '5': 0, '4': 0} or {'1': 1, '3': 1, '2': 1, '5': 1, '4': 0}
Now what is the most efficient/succinct way to extract that data form the new column? I've tried iterating through the DataFrame with no good results i.e
for row_index, row in df.iterrows():
r = row['places']
if r is not None:
df.ix[row_index]['large_super'] = r['1']
df.ix[row_index]['small_super'] = r['2']
This does not seem to be working.
Thanks.

Is this what you are intending to do?
for i in range(1,6):
df['super_'+str(i)] = df['place'].map(lambda x: x.count(str(i)) )

Related

Empty Dictionary while trying to count number of different characters in Input String using a dictionary

I get an empty dictionary while I m trying to count number of different characters(upper case and lower case) in an given string.
Here is my code that i tried: in an if condition i put variable a =1 , to do nothing in if condition.
input_str = "AAaabbBBCC"
histogram = dict()
for idx in range(len(input_str)):
val = input_str[idx]
# print(val)
if val not in histogram:
# do nothing
a = 1
else:
histogram[val] = 1
print(histogram)
#print("number of different are :",len(histogram))
here is my code output:
{}
I am expecting a output as below:
{ 'A': 1,
'a': 1,
'b': 1,
'B': 1,
'C': 1
}
If you wanted to count the number of distinct values in your string, you could do it this way
input_str = "AAaabbBBCC"
histogram = dict()
for idx in range(len(input_str)):
val = input_str[idx]
if val not in histogram:
#add to dictionary
histogram[val] = 1
else:
#increase count
histogram[val] += 1
>>> histogram
{'A': 2, 'a': 2, 'b': 2, 'B': 2, 'C': 2}

How to loop through a dictionary of dictionaries and make a 2d array?

So, I have a dictionary like this:
dic_parsed_sentences = {'religion': {'david': 1, 'joslin': 1, 'apolog': 5, 'jim': 1, 'meritt': 2},
'sport': {'sari': 1, 'basebal': 1, 'kolang': 5, 'footbal': 1, 'baba': 2},
'education': {'madrese': 1, 'kelas': 1, 'yahyah': 5, 'dars': 1},
'computer': {'net': 1, 'internet': 1},
'windows': {'copy': 1, 'right': 1}}
I want to loop through it based on the length of the dictionaries within that dictionary.
For example,
it has two items with length 5, one item with length 4, and two items with length 2. I want to process the same length items together (something like a group by in pandas).
So the output of the first iteration will look like this (as you see only items with length 5 are available here):
[[david, joslin, apolog, jim, meritt],
[sari, baseball, kolang, footbal, baba]]
and next iteration it will make the next same length items:
[[madrese, kelas, yahyah, dars]]
And the last iteration:
[[net, internet],
[copy, right]]
Why do we only have three iterations here? Because we only have three different lengths of items within the dictionary dic_parsed_sentences.
I have done something like this, but I dont know how to iterate through the same length items:
for i in dic_parsed_sentences.groupby(dic_parsed_sentences.same_length_items): # this line is sodoku line I dont know how to code it(I mean iterate through same length items in the dicts)
for index_file in dic_parsed_sentences:
temp_sentence = dic_parsed_sentences[index_file]
keys_words = list(temp_sentence.keys())
for index_word in range(len(keys_words)):
arr_sent_wids[index_sentence, index_word] =
keys_words[index_word]
index = index + 1
index_sentence = index_sentence + 1
Update:
for length, dics in itertools.groupby(dic_parsed_sentences, len):
for index_file in dics:
temp_sentence = dics[index_file]
keys_words = list(temp_sentence.keys())
for index_word in range(len(keys_words)):
test_sent_wids[index_sentence, index_word] = lookup_word2id(keys_words[index_word])
index = index + 1
index_sentence = index_sentence + 1
You can use itertools.groupby after sorting the dictionary elements by length.
import itertools
items = sorted(dic_parsed_sentences.values(), key = len, reverse = True)
for length, dics in itertools.groupby(items, len):
# dics is all the nested dictionaries with this length
for temp_sentence in dics:
keys_words = list(temp_sentence.keys())
for index_word in range(len(keys_words)):
test_sent_wids[index_sentence, index_word] = lookup_word2id(keys_words[index_word])
index = index + 1
index_sentence = index_sentence + 1
bylen = {}
for v in dic_parsed_sentences.values():
l = len(v)
if not l in bylen:
bylen[l] = []
bylen[l].append(list(v.keys()))
for k in reversed(sorted(bylen.keys())):
# use bylen[k]
You can do it using the following method:
finds = [[key, len(dic_parsed_sentences[key])] for key in dic_parsed_sentences]
finds.sort(reverse=True, key=lambda x: x[1])
previous = finds[0][1]
res = []
for elem in finds:
current = elem[1]
if current != previous:
previous = current
print(res)
res = []
res.append(list(dic_parsed_sentences[elem[0]]))
print(res)

TO get count of list of words from a pandas data frame where each column is a list of words

So i basically have a pandas data frame :
Say
1. oshin oshin1 oshin2
2. oshin3 oshin2 oshin4
I want to get a counter in such a way (basically my output) should be:
oshin:1
oshin1:1
oshin2:2
oshin3:1
oshin4:1
Such that i can export the output to a csv file as it is going to be really long.
How do i do it in pandas?
OR
how can i do it for any column in pandas for a matter of fact.
I think you need first create lists in each column by apply and split, then convert to numpy array by values and flat by numpy.ravel. Convert to list and apply Counter, last convert to dict:
print (df)
col
0 oshin oshin1 oshin2
1 oshin3 oshin2 oshin4
from collections import Counter
cols = ['col', ...]
d = dict(Counter(np.concatenate(df[cols].apply(lambda x : x.str.split()) \
.values.ravel().tolist())))
print (d)
{'oshin3': 1, 'oshin4': 1, 'oshin1': 1, 'oshin': 1, 'oshin2': 2}
But if only one column (thanks Jon Clements):
d = dict(df['col'].str.split().map(Counter).sum())
print (d)
{'oshin3': 1, 'oshin4': 1, 'oshin1': 1, 'oshin': 1, 'oshin2': 2}
EDIT:
Another faster solution from John Galt, thank you:
d = pd.Series(' '.join(df['col']).split()).value_counts().to_dict()
print (d)
{'oshin3': 1, 'oshin4': 1, 'oshin1': 1, 'oshin': 1, 'oshin2': 2}

Ordering data by index after importing dictionary of dictionaries into DataFrame

I am trying to reorder a DataFrame by index. This DataFrame was created from a dictionary of dictionaries. I am trying to use DataFrame.sort_values. Although, the sorting is making absolutely no difference when I try to print the DataFrame.
The following code exemplifies what I am trying to achieve:
import pandas as pd
# Metrics dictionary keys
GOLD_CNT_KEY = 'Gold_Cnt'
PRED_CNT_KEY = 'Pred_Cnt'
NER_INTERSEC_CNT_KEY = 'NER_Intersec_Cnt'
NER_PREC_KEY = 'NER_Precision'
NER_REC_KEY = 'NER_Recall'
NER_F1_KEY = 'NER_F1'
NERC_INTERSEC_CNT_KEY = 'NERC_Intersec_Cnt'
NERC_PREC_KEY = 'NERC_Precision'
NERC_REC_KEY = 'NERC_Recall'
NERC_F1_KEY = 'NERC_F1'
tag_classes = ['X', 'Y', 'Z', 'V']
def get_empty_stats_dict():
"""Dictionary for the counts and metrics"""
return {GOLD_CNT_KEY: 0,
PRED_CNT_KEY: 0,
NER_INTERSEC_CNT_KEY: 0,
NER_PREC_KEY: 0,
NER_REC_KEY: 0,
NER_F1_KEY: 0,
NERC_INTERSEC_CNT_KEY: 0,
NERC_PREC_KEY: 0,
NERC_REC_KEY: 0,
NERC_F1_KEY: 0}
stats = {}
for tag_class in tag_classes:
stats.update({tag_class: get_empty_stats_dict()})
# I want to order by these indexes
index_order = [GOLD_CNT_KEY, PRED_CNT_KEY, NER_INTERSEC_CNT_KEY, NERC_INTERSEC_CNT_KEY,
NER_PREC_KEY, NERC_PREC_KEY, NER_REC_KEY, NERC_REC_KEY, NER_F1_KEY, NERC_F1_KEY]
_stats = pd.DataFrame(stats)
# These two prints yield exactly the same output. Why doesn't sort_values make any difference?
print(_stats)
print(_stats.sort_values(index_order, axis=1))
Try: _stats = _stats.reindex(index_order)

Modifying column of 2d list while iterating over it in python

I am trying to write a function that turns all the non-numerical columns in a data set to numerical form.
The data set is a list of lists.
Here is my code:
def handle_non_numerical_data(data):
def convert_to_numbers(data, index):
items = []
column = [line[0] for line in data]
for item in column:
if item not in items:
items.append(item)
[line[0] = items.index(line[0]) for line in data]
return new_data
for value in data[0]:
if isinstance(value, str):
convert_to_numbers(data, data[0].index(value))
Apparently [line[0] = items.index(line[0]) for line in data] is not valid syntax and I cant figure out how to modify the first column of data while iterating over it.
I can't use numpy because the data will not be in numerical form until after this function is run.
How do I do this and why is it so complicated? I feel like this should be way simpler than it is...
In other words, I want to turn this:
[[M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15],
[M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7],
[F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9]]
into this:
[[0,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15],
[0,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7],
[1,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9]]
Note that the first column was changed from strings to numbers.
Solution
data = [['M',0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15],
['M',0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7],
['F',0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9]]
values = {'M': 0, 'F': 1}
new_data = [[values.get(val, val) for val in line] for line in data]
new_data
Output:
[[0, 0.455, 0.365, 0.095, 0.514, 0.2245, 0.101, 0.15, 15],
[0, 0.35, 0.265, 0.09, 0.2255, 0.0995, 0.0485, 0.07, 7],
[1, 0.53, 0.42, 0.135, 0.677, 0.2565, 0.1415, 0.21, 9]]
Explanation
You can take advantage of Python dictionaries and their get method.
These are values for the strings:
values = {'M': 0, 'F': 1}
You can also add more strings like I with a corresponding value.
If the string is values, you will get the value from the dict:
>>> values.get('M', 'M')
0
Otherwise, you will get the original value:
>>> values.get(10, 10)
10
Rather than indexing (which I'm not sure how it was supposed to work in your example), you can instead create a dictionary mapping for letters to numbers. Something like this should work.
raw_data = [['M',0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15],
['M',0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7],
['F',0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9]]
def handle_non_numerical_data(data):
mapping = {'M': 0, 'F': 1, 'I': 2}
for item in raw_data:
if isinstance(item[0], str):
item[0] = mapping.get(item[0], -1) # Returns -1 if letter not found
return data
run = handle_non_numerical_data(raw_data)
print(run)
This answer will use a dict to store the coding from str to int. It can be preloaded and also investigated after the data has been replaced.
# MODIFIES DATA IN PLACE
data = [['M',0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15],
['M',0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7],
['F',0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9]]
coding_dict = {} # can also preload this {'M': 0, 'F':1}
for row in data:
if row[0] not in coding_dict:
coding_dict[row[0]] = len(coding_dict)
row[0] = coding_dict[row[0]]

Categories

Resources