By the following program, I am trying to calculate the number of occurance of '0','1','2',and '3' for each column. The program is not working as desired. I read somewhere that slicing of the matrix should be done for computing the occurance column wise but I am not sure how to do it. The program is written using numpy in python. How to do it using numpy?
import numpy as np
a=np.array([[ 2,1,1,2,1,1,2], #t1 is horizontal
[1,1,2,2,1,1,1],
[2,1,1,1,1,2,1],
[3,3,3,2,3,3,3],
[3,3,2,3,3,3,2],
[3,3,3,2,2,2,3],
[3,2,2,1,1,1,0]])
print(a)
i=0
j=0
two=0
zero=0
one=0
three=0
r=a.shape[0]
c=a.shape[1]
for i in range(1,r):
#print(repr(a))
for j in range(1,c):
#sele=a[i,j]
if (a[i,j]==0):
zero+=1
if (a[i,j]==1):
one+=1
if (a[i,j]==2):
two+=1
if (a[i,j]==3):
three+=1
if i==c-1:
#print(zero)
print(one)
i+=0
j=j+1
#print(two)
#print(three)
i=i+1
#print(zero)`
Also I want to print it in the following manner:
column: 0 1 2 3 4 5 6
occurrences: 0 0 0 0 0 0 0 1
1 1 3 2 2 4 3 1
2 2 1 3 4 1 2 2
3 4 3 2 1 2 2 2
Here is the code using list functionality
import numpy as np
inputArr=np.array([[ 2,1,1,2,1,1,2],
[1,1,2,2,1,1,1],
[2,1,1,1,1,2,1],
[3,3,3,2,3,3,3],
[3,3,2,3,3,3,2],
[3,3,3,2,2,2,3],
[3,2,2,1,1,1,0]
])
occurance = dict()
toFindList = [0,1,2,3]
for col in range(len(inputArr)):
collist = inputArr[:, col]
collist = (list(collist))
occurance['col_' + str(col)] = {}
for num in toFindList:
occurcount = collist.count(num)
occurance['col_' + str(col)][str(num)] = occurcount
for key, value in occurance.iteritems():
print key, value
Output:
col_2 {'1': 2, '0': 0, '3': 2, '2': 3}
col_3 {'1': 2, '0': 0, '3': 1, '2': 4}
col_0 {'1': 1, '0': 0, '3': 4, '2': 2}
col_1 {'1': 3, '0': 0, '3': 3, '2': 1}
col_6 {'1': 2, '0': 1, '3': 2, '2': 2}
col_4 {'1': 4, '0': 0, '3': 2, '2': 1}
col_5 {'1': 3, '0': 0, '3': 2, '2': 2}
This should give you the output format you want:
def col_unique(a):
return np.sum(np.dstack([np.in1d(a,i).reshape(a.shape) for i in np.unique(a)]), axis = 0).T
Related
I'm having trouble accessing multiple values in a dictionary. Let's say I have this dictionary:
{'1': 0, '2': 1, '3': 2, '4': 3, '5': 4, '6': 5}
I want to find two keys that sum to 6 and display their values. Here, the keys 4 and 2 add to 6, so the 2 values are 3 and 1.
Where do I start? This is the code I have so far:
for key in dico:
if sum(key + key) == 6:
print(f"Numbers # {key:dico} have a sum of 6")
No need for extra loops (or itertools), they will only slow your program down. You already know what the other index needs to be (because you can subtract the index from 6), so just check if that index exists:
dct = {'1': 0, '2': 1, '3': 2, '4': 3, '5': 4, '6': 5}
for i, key in enumerate(dct):
if i + 2 > len(dct)/2:
break
matchIndex = str(6 - int(key))
if dct.get(matchIndex) is not None:
print(f'Keys {key} and {matchIndex} have values {dct[key]} and {dct[matchIndex]}')
This approach has a O(n/2) time complexity, while the other answer has O(n^2) time complexity.
When I tested this approach with timeit, it took 1.72 seconds to run this answer one million times, but the itertools answer took 5.83 secondss.
You will need to compare each of the dictionary keys with the rest of the other keys. You can use itertools for this.
As you mention you would like to print the value of each of the keys you have in your dictionary, it would be something like this:
import itertools
dico = {'1': 0, '2': 1, '3': 2, '4': 3, '5': 4, '6': 5}
for a, b in itertools.combinations(dico.keys(), 2):
if int(a) + int(b) == 6:
print(f"{dico[a]} - {dico[b]}")
You need two loops for that.
Also, have in mind that there is more than one answer to that problem
a = {'1': 0, '2': 1, '3': 2, '4': 3, '5': 4, '6': 5}
results = list()
for key_1 in a.keys():
for key_2 in a.keys():
if key_1 != key_2:
if a[key_1] + a[key_2] == 6:
if a[key_1] < a[key_2]:
results.append((key_1, key_2))
print(results)
Consider the following code snippet:
foo = {'a': 0, 'b': 1, 'c': 2}
for k1 in foo:
for k2 in foo:
print(foo[k1], foo[k2])
The output will be
0 0
0 1
0 2
1 0
1 1
1 2
2 0
2 1
2 2
I do not care for the order of the key couples, so I would like a code that outputs
0 0
0 1
0 2
1 1
1 2
2 2
I tried with
foo = {'a': 0, 'b': 1, 'c': 2}
foo_2 = foo.copy()
for k1 in foo_2:
for k2 in foo_2:
print(foo[k1], foo[k2])
foo_2.pop(k1)
but I clearly got
RuntimeError: dictionary changed size during iteration
Other solutions?
foo = {'a': 0, 'b': 1, 'c': 2}
foo_2 = foo.copy()
for k1 in foo:
for k2 in foo_2:
print(foo[k1], foo[k2])
foo_2.pop(k1)
You looped in foo_2 two times and when you tried to pop k1 from foo_2 it changed the dictionary while looping causing the error so by first looping foo you avoid the error.
>>> foo = {'a': 0, 'b': 1, 'c': 2}
>>> keys = list(foo.keys())
>>> for i, v in enumerate(keys):
... for j, v2 in enumerate(keys[i:]):
... print(foo[v], foo[v2])
...
0 0
0 1
0 2
1 1
1 2
2 2
A basic approach.
foo = {'a': 0, 'b': 1, 'c': 2}
for v1 in foo.values():
for v2 in foo.values():
if v1 <= v2:
print(v1, v2)
It could be done with itertools.combinations_with_replacement as well:
from itertools import combinations_with_replacement
foo = {'a': 0, 'b': 1, 'c': 2}
print(*[f'{foo[k1]} {foo[k2]}' for k1, k2 in combinations_with_replacement(foo.keys(), r=2)], sep='\n')
You can just pass the values of foo dictionary to a list and loop.
foo = {'a': 0, 'b': 1, 'c': 2}
val_list = list(foo.values())
for k1 in foo.values():
for row in val_list:
print(k1, row)
val_list.pop(0)
The data looks like this:
d = {'location_id': [1, 2, 3, 4, 5], 'x': [47.43715, 48.213889, 46.631111, 46.551111, 47.356628], 'y': [11.880689, 14.274444, 14.371, 13.665556, 11.705181]}
df = pd.DataFrame(data=d)
print(df)
location_id x y
0 1 47.43715 11.880689
1 2 48.213889 14.274444
2 3 46.631111 14.371
3 4 46.551111 13.665556
4 5 47.356628 11.705181
Expected output:
{(47.43715, 11.880689): 1, (48.213889, 14.274444): 2, (46.631111, 14.371): 3, ...}
So i can simply access ID providing point coordinates.
What i have tried:
dict(zip(df['x'].astype('float'), df['y'].astype('float'), zip(df['location_id'])))
Error: ValueError: dictionary update sequence element #0 has length 3; 2 is required
or
dict(zip(tuple(df['x'].astype('float'), df['y'].astype('float')), zip(df['location_id'])))
TypeError: tuple expected at most 1 arguments, got 2
I have Googled for it a while, but I am not very clear about it. Thank you for any assistance.
I think this
result = dict(zip(zip(df['x'], df['y']), df['location_id']))
should give you what you want? Result:
{(47.43715, 11.880689): 1,
(48.213889, 14.274444): 2,
(46.631111, 14.371): 3,
(46.551111, 13.665556): 4,
(47.356628, 11.705181): 5}
I didn't use a dataframe, is this what you wanted?
my_dict = {}
d = {'location_id': [1, 2, 3, 4, 5], 'x': [47.43715, 48.213889, 46.631111, 46.551111, 47.356628], 'y': [11.880689, 14.274444, 14.371, 13.665556, 11.705181]}
for i in range(len(d['location_id'])):
my_dict[ (d['x'][i] , d['y'][i]) ] = d['location_id'][i]
You can set x and y column as index then export location_id column to dictionary
d = df.set_index(['x', 'y'])['location_id'].to_dict()
print(d)
{(47.43715, 11.880689): 1, (48.213889, 14.274444): 2, (46.631111, 14.371): 3, (46.551111, 13.665556): 4, (47.356628, 11.705181): 5}
grateful for your help for what feels like a stupid question. I've pulled a sqlite table into a pandas dataframe so I can tokenize and count the frequency of words from a series of tweets.
With the code below, I can produce this for the first tweet. How do I iterate for the whole table?
conn = sqlite3.connect("tweets.sqlite")
data = pd.read_sql_query("select tweet_text from tweets_new;", conn)
tokenizer=RegexpTokenizer(r'\w+')
tokens=tokenizer.tokenize(data['tweet_text'][0])
words = nltk.FreqDist(tokens)
unigram_df = pd.DataFrame(words.most_common(),
columns=["WORD","COUNT"])
unigram_df
When I change the value to anything other than a single row, I get the following error:
TypeError: expected string or buffer
I know there are other ways of doing this, but I need to do it along these lines because of how I intend to use the output next. Thanks for any help you can provide!
I have tried:
%%time
tokenizer = RegexpTokenizer(r'\w+')
print "Cleaning the tweets...\n"
for i in xrange(0,len(df)):
if( (i+1)%1000000 == 0 ):
tokens=tokenizer.tokenize(df['tweet_text'][i])
words = nltk.FreqDist(tokens)
This looks like it should work, but still only returns words from the first row.
I think your problem can be solved more concisely using CountVectorizer. I'll give you an example. Given the following inputs:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
corpus_tweets = [['I love pizza and hambuerger'],['I love apple and chips'], ['The pen is on the table!!']]
df = pd.DataFrame(corpus_tweets, columns=['tweet_text'])
You can create your bag of words template with these few lines:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(df.tweet_text)
You can print the obtained vocabulary:
count_vect.vocabulary_
# ouutput: {'love': 5, 'pizza': 8, 'and': 0, 'hambuerger': 3, 'apple': 1, 'chips': 2, 'the': 10, 'pen': 7, 'is': 4, 'on': 6, 'table': 9}
and get the dataframe with word counts:
df_count = pd.DataFrame(X_train_counts.todense(), columns=count_vect.get_feature_names())
and apple chips hambuerger is love on pen pizza table the
0 1 0 0 1 0 1 0 0 1 0 0
1 1 1 1 0 0 1 0 0 0 0 0
2 0 0 0 0 1 0 1 1 0 1 2
If it is useful for you, you can merge the dataframe of the counts with the dataframe of the corpus:
pd.concat([df, df_count], axis=1)
tweet_text and apple chips hambuerger is love on \
0 I love pizza and hambuerger 1 0 0 1 0 1 0
1 I love apple and chips 1 1 1 0 0 1 0
2 The pen is on the table!! 0 0 0 0 1 0 1
pen pizza table the
0 0 1 0 0
1 0 0 0 0
2 1 0 1 2
If you want to get the dictionary containing the <word, count> pairs for each document, at this point all you need to do is:
dict_count = df_count.T.to_dict()
{0: {'and': 1,
'apple': 0,
'chips': 0,
'hambuerger': 1,
'is': 0,
'love': 1,
'on': 0,
'pen': 0,
'pizza': 1,
'table': 0,
'the': 0},
1: {'and': 1,
'apple': 1,
'chips': 1,
'hambuerger': 0,
'is': 0,
'love': 1,
'on': 0,
'pen': 0,
'pizza': 0,
'table': 0,
'the': 0},
2: {'and': 0,
'apple': 0,
'chips': 0,
'hambuerger': 0,
'is': 1,
'love': 0,
'on': 1,
'pen': 1,
'pizza': 0,
'table': 1,
'the': 2}}
Note: turning X_train_counts which is a sparse numpy matrix into a dataframe is not a good idea. But it can be useful to understand and visualize the various steps of your model.
After creating the DataFrame loop over all the rows:
tokenizer = RegexpTokenizer(r'\w+')
fdist = FreqDist()
for txt in data['tweet_text']:
for word in tokenizer.tokenize(txt):
fdist[word.lower()] += 1
In case anyone is interested in this niche use case, here's the code I was eventually able to make work:
conn = sqlite3.connect("tweets.sqlite")
data = pd.read_sql_query("select tweet_text from tweets_new;", conn)
alldata = str(data)
tokenizer=RegexpTokenizer(r'\w+')
tokens=tokenizer.tokenize(alldata)
words = nltk.FreqDist(tokens)
unigram_df = pd.DataFrame(words.most_common(),
columns=["WORD","COUNT"])
Thanks for your help everyone!
I have a DataFrame df, say, 100 rows and 10 columns. I would like to get values on condition that an element in each column is greater than the element in that column but last row.
df = pd.DataFrame.from_dict({
'a': {0: 1, 1: 3, 2: 2, 3: 4},
'b': {0: 1, 1: 4, 2: 2, 3: 6},
'c': {0: 0, 1: 1, 2: 2, 3: 4},
'd': {0: 0, 1: 0, 2: 1, 3: 6},
})
from the 2nd row, index = 1, I have to judge that, for each column
if the 3rd row is greater than 2nd row, return diff
if the 3rd row is smaller than 2nd row, return sum
if the 3rd row is equal to 2nd row, return 0
for example, I would like to get
df_outcome = pd.DataFrame.from_dict({
'a': {0: 1, 1: 2, 2: 5, 3: 2},
'b': {0: 1, 1: 3, 2: 6, 3: 4},
'c': {0: 0, 1: 1, 2: 1, 3: 2},
'd': {0: 0, 1: 0, 2: 0, 3: 5},
})
because
df.iloc[2,0] < df.iloc[1,0]
df_outcome.iloc[2,0] = df.iloc[2,0] + df.iloc[1,0] = 3 + 2
also because
df.iloc[2,2] > df.iloc[1,2]
df_outcome.iloc[2,2] = df.iloc[2,2] - df.iloc[1,2] = 2 - 1 = 1
Yeah, I use an awkward way to achieve this. I wonder whether .applymap is available in this case. If it is, how could I code that func which contains the element last row at the same column?
The complicated original code is as followed.
weightMatrix = pd.DataFrame(np.random.random((100,10)))
def func(weightMatrix):
cfList = pd.DataFrame(weightMatrix,columns = weightMatrix.columns)
for col in range(len(weightMatrix.columns)):
for row in range(len(weightMatrix)):
if row == 0:
cfList.iloc[row,col] = weightMatrix.iloc[row,col]
continue
if (weightMatrix.iloc[row,col] * weightMatrix.iloc[row-1,col] > 0 ) & (weightMatrix.iloc[row,col]>0 ):
cfList.iloc[row,col] = np.max(weightMatrix.iloc[row,col] - weightMatrix.iloc[row-1,col],0)
elif (weightMatrix.iloc[row,col] * weightMatrix.iloc[row-1,col] > 0 ) & (weightMatrix.iloc[row,col]<0 ):
cfList.iloc[row,col] = np.max(weightMatrix.iloc[row-1,col] - weightMatrix.iloc[row,col],0)
elif (weightMatrix.iloc[row,col] * weightMatrix.iloc[row-1,col] == 0 ) & (weightMatrix.iloc[row-1,col] == 0 ):
cfList.iloc[row,col] = np.abs(weightMatrix.iloc[row,col])
elif (weightMatrix.iloc[row,col] * weightMatrix.iloc[row-1,col] == 0 ) & (weightMatrix.iloc[row-1,col] != 0 ):
cfList.iloc[row,col] = 0
else:
cfList.iloc[row,col] = np.abs(weightMatrix.iloc[row,col])
return cfList
If you want to compare a row with the previous or next row, df.shift() is your friend
If you want to do things depending on certain conditions, df.where() can help you
df1 = df.shift(1,).iloc[1:].astype(int)
df2 = df.iloc[1:]
result = df2.copy()
result = df2.where(df1!=df2, 0)
result = (df2 - df1).where(df1 < df2, result)
result = (df1 + df2).where(df1 > df2, result)
result = pd.concat((df.iloc[[0]], result), axis=0)
a b c d
0 1 1 0 0
1 2 3 1 0
2 5 6 1 1
3 2 4 2 5
PS. I think your df_outcome is wrong for element d2, If not, I missed something in your explanation