Related
I have a dataframe, with Count arranged in decending order, that looks something like this:
df = pd.DataFrame({'Topic': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M'],
'Count': [80, 75, 70, 65, 60, 55, 50, 45, 40, 35, 30, 25, 20]})
But with more than 50 rows.
I would like to create a pie chart for the top 10 topics and rest of them to be summed up and represent its percentange as label "Others" in the pie chart. Is it possible to exclude the pie labels against each pie, and mention them seperately in a legend?
Thanking in anticipation
Replace Topic by Other if no top N in Series.where and then aggregate sum with Series.plot.pie:
N = 10
df['Topic'] = df['Topic'].where(df['Count'].isin(df['Count'].nlargest(N)), 'Other')
s = df.groupby('Topic')['Count'].sum()
pie = df.plot.pie(y='Count', legend=False)
#https://stackoverflow.com/a/44076433/2901002
labels = [f'{l}, {s:0.1f}%' for l, s in zip(s.index, s / s.sum())]
plt.legend(bbox_to_anchor=(0.85, 1), loc='upper left', labels=labels)
You need to craft a new dataframe. Assuming your counts are sorted in descending order (if not, use df.sort_values(by='Count', inplace=True)):
TOP = 10
df2 = df.iloc[:TOP]
df2 = df2.append({'Topic': 'Other', 'Count': df['Count'].iloc[TOP:].sum()},
ignore_index=True)
df2.set_index('Topic').plot.pie(y='Count', legend=False)
Example (N=10, N=5):
Percentages in the legend:
N = 5
df2 = df.iloc[:N]
df2 = df2.append({'Topic': 'Other', 'Count': df['Count'].iloc[N:].sum()}, ignore_index=True)
df2.set_index('Topic').plot.pie(y='Count', legend=False)
leg = plt.legend(labels=df2['Count'])
output:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'Topic': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M'],
'Count': [80, 75, 70, 65, 60, 55, 50, 45, 40, 35, 30, 25, 20]})
df.index = df.Topic
plot = df.plot.pie(y='Count', figsize=(5, 5))
plt.show()
Use documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.pie.html
I have a dataset and I need to groupby my dataset based on column group:
import numpy as np
import pandas as pd
arr = np.array([1, 2, 4, 7, 11, 16, 22, 29, 37, 46])
df = pd.DataFrame({'group': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
"target": arr})
for g_name, g_df in df.groupby("group"):
print("GROUP: {}".format(g_name))
print(g_df)
However, sometimes group might not exist as a column and in this case, I am trying to whole data as a single group.
for g_name, g_df in df.groupby(SOMEPARAMETERS):
print(g_df)
output:
target
1
2
4
7
11
16
22
29
37
46
Is it possible to change the parameter of groupby to get whole data as a single group?
Assuming you mean something like this where you have two columns on which you want to group:
import numpy as np
import pandas as pd
arr = np.array([1, 2, 4, 7, 11, 16, 22, 29, 37, 46])
df = pd.DataFrame({'group1': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
'group2': ['C', 'D', 'D', 'C', 'D', 'D', 'C', 'D', 'D', 'C'],
'target': arr})
Then you can easily extend your example with:
for g_name, g_df in df.groupby(["group1", "group2"]):
print("GROUP: {}".format(g_name))
print(g_df)
Is this what you meant?
I would like to draw box-plot for below data set but I don't need tails . I need a rectangle with max and min on edges . By the way it does not have to be a rectangle it could be a thick line
Please help.
Thank you.
import seaborn as sns
import pandas as pd
df=pd.DataFrame({'grup':['a','a','a','a','b','b','b','c','c','c','c','c','c','c'],'X1':
[10,9,12,5,20,43,28,40,65,78,65,98,100,150]})
df
ax = sns.boxplot(x="grup", y="X1", data=df, palette="Set3")
You can create a barplot, using the minimums as 'bottom' and the difference between maximums and minimums as heights.
Note that a barplot has a "sticky" bottom, fixing the lowest point of the y-axis to the lowest bar. As a remedy, we can change the ylim.
import seaborn as sns
import pandas as pd
import numpy as np
df = pd.DataFrame({'grup': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'c', 'c', 'c', 'c'],
'X1': [10, 9, 12, 5, 20, 43, 28, 40, 65, 78, 65, 98, 100, 150]})
grups = np.unique(df['grup'])
bottoms = [df[df['grup'] == g]['X1'].min() for g in grups]
heights = [df[df['grup'] == g]['X1'].max() - g_bottom for g, g_bottom in zip(grups, bottoms)]
ax = sns.barplot(x=grups, y=heights, bottom=bottoms, palette="Set3", ec='black')
# for reference, show where the values are; leave this line out for the final plot
sns.stripplot(x='grup', y='X1', color='red', s=10, data=df, ax=ax)
ax.set_xlabel('grup') # needed because the barplot isn't directly using a dataframe
ax.set_ylabel('X1')
ax.set_ylim(ymin=0)
Update: adding the minimum and maximum values:
import seaborn as sns
import pandas as pd
import numpy as np
df = pd.DataFrame({'grup': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'c', 'c', 'c', 'c'],
'X1': [10, 9, 12, 5, 20, 43, 28, 40, 65, 78, 65, 98, 100, 150]})
grups = np.unique(df['grup'])
bottoms = np.array([df[df['grup'] == g]['X1'].min() for g in grups])
tops = np.array([df[df['grup'] == g]['X1'].max() for g in grups])
ax = sns.barplot(x=grups, y=tops - bottoms, bottom=bottoms, palette="Set3", ec='black')
ax.set_xlabel('grup') # needed because the barplot isn't directly using a dataframe
ax.set_ylabel('X1')
ax.set_ylim(ymin=0)
for i, (grup, bottom, top) in enumerate(zip(grups, bottoms, tops)):
ax.text(i, bottom, f'\n{bottom}', ha='center', va='center')
ax.text(i, top, f'{top}\n', ha='center', va='center')
Given the following DataFrame (in pandas):
X Y Type Region
index
1 100 50 A US
2 50 25 A UK
3 70 35 B US
4 60 40 B UK
5 80 120 C US
6 120 35 C UK
In order to generate the DataFrame:
import pandas as pd
data = pd.DataFrame({'X': [100, 50, 70, 60, 80, 120],
'Y': [50, 25, 35, 40, 120, 35],
'Type': ['A', 'A', 'B', 'B', 'C', 'C'],
'Region': ['US', 'UK'] * 3
},
columns=['X', 'Y', 'Type', 'Region']
)
I tried to make a scatter plot of X and Y, colored by Type and shaped by Region. How could I achieve it in matplotlib?
With more Pandas:
from pandas import DataFrame
from matplotlib.pyplot import show, subplots
from itertools import cycle # Useful when you might have lots of Regions
data = DataFrame({'X': [100, 50, 70, 60, 80, 120],
'Y': [50, 25, 35, 40, 120, 35],
'Type': ['A', 'A', 'B', 'B', 'C', 'C'],
'Region': ['US', 'UK'] * 3
},
columns=['X', 'Y', 'Type', 'Region']
)
cs = {'A':'red',
'B':'blue',
'C':'green'}
markers = ('+','o','>')
fig, ax = subplots()
for region, marker in zip(set(data.Region),cycle(markers)):
reg_data = data[data.Region==region]
reg_data.plot(x='X', y='Y',
kind='scatter',
ax=ax,
c=[cs[x] for x in reg_data.Type],
marker=marker,
label=region)
ax.legend()
show()
For this kind of multi-dimensional plot, though, check out seaborn (works well with pandas).
An approach would be to do the following. It is not elegant, but works
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
plt.ion()
colors = ['g', 'r', 'c', 'm', 'y', 'k', 'b']
markers = ['*','+','D','H']
for iType in range(len(data.Type.unique())):
for iRegion in range(len(data.Region.unique())):
plt.plot(data.X.values[np.bitwise_and(data.Type.values == data.Type.unique()[iType],
data.Region.values == data.Region.unique()[iRegion])],
data.Y.values[np.bitwise_and(data.Type.values == data.Type.unique()[iType],
data.Region.values == data.Region.unique()[iRegion])],
color=colors[iType],marker=markers[iRegion],ms=10)
I am not familiar with Panda, but there must some more elegant way to do the filtering. A marker list can be obtained using markers.MarkerStyle.markers.keys() from matplotlib and the conventional color cycle can be obtained using gca()._get_lines.color_cycle.next()
I am trying to write a program that checks if smaller words are found within a larger word. For example, the word "computer" contains the words "put", "rum", "cut", etc. To perform the check I am trying to code each word as a product of prime numbers, that way the smaller words will all be factors of the larger word. I have a list of letters and a list of primes and have assigned (I think) an integer value to each letter:
letters = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm',
'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
primes = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59,
61, 67, 71, 73, 79, 83, 89, 97, 101]
index = 0
while index <= len(letters)-1:
letters[index] = primes[index]
index += 1
The problem I am having now is how to get the integer code for a given word and be able to create the codes for a whole list of words. For example, I want to be able to input the word "cab," and have the code generate its integer value of 5*2*3 = 30.
Any help would be much appreciated.
from functools import reduce # only needed for Python 3.x
from operator import mul
primes = [
2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41,
43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101
]
lookup = dict(zip("abcdefghijklmnopqrstuvwxyz", primes))
def encode(s):
return reduce(mul, (lookup.get(ch, 1) for ch in s.lower()))
then
encode("cat") # => 710
encode("act") # => 710
Edit: more to the point,
def is_anagram(s1, s2):
"""
s1 consists of the same letters as s2, rearranged
"""
return encode(s1) == encode(s2)
def is_subset(s1, s2):
"""
s1 consists of some letters from s2, rearranged
"""
return encode(s2) % encode(s1) == 0
then
is_anagram("cat", "act") # => True
is_subset("cat", "tactful") # => True
I would use a dict here to look-up the prime for a given letter:
In [1]: letters = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm',
'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
In [2]: primes = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59,
61, 67, 71, 73, 79, 83, 89, 97, 101]
In [3]: lookup = dict(zip(letters, primes))
In [4]: lookup['a']
Out[4]: 2
This will let you easily determine the list of primes for a given word:
In [5]: [lookup[letter] for letter in "computer"]
Out[5]: [5, 47, 41, 53, 73, 71, 11, 61]
To find the product of those primes:
In [6]: import operator
In [7]: reduce(operator.mul, [lookup[letter] for letter in "cab"])
Out[7]: 30
You've got your two lists set up, so now you just need to iterate over each character in a word and determine what value that letter gives you.
Something like
total = 1
for letter in word:
index = letters.index(letter)
total *= primes[index]
Or whichever operation you decide to use.
You would generalize that to a list of words.
Hmmmm... It isn't very clear how this code is supposed to run. If it is built to find words in the english dictionary, think about using PyEnchant, a module for checking if words are in the dictionary. Something you could try is this:
letters = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
primes = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101]
word = raw_input('What is your word? ')
word = list(word)
total = 1
nums = []
for k in word:
nums.append(primes[letters.index(k)])
for k in nums:
total = total*k
print total
This will output as:
>>> What is your word? cat
710
>>>
This is correct, as 5*2*71 equals 710