How to check frequency of every unique value from pandas data-frame? - python

If I have a data-frame of 2000 and in which let say brand have 142 unique values and i want to count frequency of every unique value form 1 to 142.values should change dynamically.
brand=clothes_z.brand_name
brand.describe(include="all")
unique_brand=brand.unique()
brand.describe(include="all"),unique_brand
Output:
(count 2613
unique 142
top Mango
freq 54
Name: brand_name, dtype: object,
array(['Jack & Jones', 'TOM TAILOR DENIM', 'YOURTURN', 'Tommy Jeans',
'Alessandro Zavetti', 'adidas Originals', 'Volcom', 'Pier One',
'Superdry', 'G-Star', 'SIKSILK', 'Tommy Hilfiger', 'Karl Kani',
'Alpha Industries', 'Farah', 'Nike Sportswear',
'Calvin Klein Jeans', 'Champion', 'Hollister Co.', 'PULL&BEAR',
'Nike Performance', 'Even&Odd', 'Stradivarius', 'Mango',
'Champion Reverse Weave', 'Massimo Dutti', 'Selected Femme Petite',
'NAF NAF', 'YAS', 'New Look', 'Missguided', 'Miss Selfridge',
'Topshop', 'Miss Selfridge Petite', 'Guess', 'Esprit Collection',
'Vero Moda', 'ONLY Petite', 'Selected Femme', 'ONLY', 'Dr.Denim',
'Bershka', 'Vero Moda Petite', 'PULL & BEAR', 'New Look Petite',
'JDY', 'Even & Odd', 'Vila', 'Lacoste', 'PS Paul Smith',
'Redefined Rebel', 'Selected Homme', 'BOSS', 'Brave Soul', 'Mind',
'Scotch & Soda', 'Only & Sons', 'The North Face',
'Polo Ralph Lauren', 'Gym King', 'Selected Woman', 'Rich & Royal',
'Rooms', 'Glamorous', 'Club L London', 'Zalando Essentials',
'edc by Esprit', 'OYSHO', 'Oasis', 'Gina Tricot',
'Glamorous Petite', 'Cortefiel', 'Missguided Petite',
'Missguided Tall', 'River Island', 'INDICODE JEANS',
'Kings Will Dream', 'Topman', 'Esprit', 'Diesel', 'Key Largo',
'Mennace', 'Lee', "Levi's®", 'adidas Performance', 'jordan',
'Jack & Jones PREMIUM', 'They', 'Springfield', 'Benetton', 'Fila',
'Replay', 'Original Penguin', 'Kronstadt', 'Vans', 'Jordan',
'Apart', 'New look', 'River island', 'Freequent', 'Mads Nørgaard',
'4th & Reckless', 'Morgan', 'Honey punch', 'Anna Field Petite',
'Noisy may', 'Pepe Jeans', 'Mavi', 'mint & berry', 'KIOMI', 'mbyM',
'Escada Sport', 'Lost Ink', 'More & More', 'Coffee', 'GANT',
'TWINTIP', 'MAMALICIOUS', 'Noisy May', 'Pieces', 'Rest',
'Anna Field', 'Pinko', 'Forever New', 'ICHI', 'Seafolly', 'Object',
'Freya', 'Wrangler', 'Cream', 'LTB', 'G-star', 'Dorothy Perkins',
'Carhartt WIP', 'Betty & Co', 'GAP', 'ONLY Tall', 'Next', 'HUGO',
'Violet by Mango', 'WEEKEND MaxMara', 'French Connection'],
dtype=object))
As it is showing only frequency of Mango "54" because it is top frequency and I want every value frequency like what is the frequency of Jack & Jones, TOM TAILOR DENIM and YOURTURN and so on... and values should change dynamically.

You could simply do,
clothes_z.brand_name.value_counts()
This would list down the unique values and would give you the frequency of every element in that Pandas Series.

from collections import Counter
ll = [...your list of brands...]
c = Counter(ll)
# you can do whatever you want with your counted values
df = pd.DataFrame.from_dict(c, orient='index', columns=['counted'])

Related

How to reaggregate a MultiIndex pandas.DataFrame?

I have a MultiIndex pandas.DataFrame dcba with a large level of spatial desagregation:
region FR \
sector Agriculture Crude coal Crude oil Natural gas
region stressor
FR CO2 4.796711 1.382087e-02 3.149139e-05 2.894532
CH4 15.816831 3.744709e-05 3.567591e-04 0.275431
N2O 9.715682 9.290865e-05 5.603963e-07 0.007834
SF6 0.028011 2.818101e-06 2.607044e-08 0.000477
HFC 1.487352 1.473641e-04 1.475096e-06 0.024675
... ... ... ... ...
RoW Middle East CH4 0.455748 3.566337e-05 7.060048e-04 0.035420
N2O 0.193417 1.176733e-06 7.366779e-07 0.002564
SF6 0.001478 7.465562e-08 2.960808e-08 0.000107
HFC 0.006629 3.190865e-07 1.281020e-07 0.000472
PFC 0.001390 1.491053e-07 3.205249e-08 0.000603
region \
sector Extractive industry Biomass_industry Clothing
region stressor
FR CO2 5.817926e-03 9.866832 0.394570
CH4 3.622520e-04 9.923741 0.075845
N2O 1.267742e-04 6.010542 0.027877
SF6 4.797571e-04 0.036355 0.000561
HFC 2.502868e-02 1.894707 0.028297
... ... ... ...
RoW Middle East CH4 1.844972e-04 0.419346 0.193006
N2O 7.236885e-06 0.062690 0.018240
SF6 9.461463e-07 0.001052 0.000477
HFC 4.114220e-06 0.004652 0.002087
PFC 1.314401e-06 0.002939 0.001726
region ... \
sector Heavy_industry Construction Automobile ...
region stressor ...
FR CO2 13.261457 14.029825 2.479608 ...
CH4 0.632317 2.537475 0.319671 ...
N2O 0.196020 0.968326 0.082451 ...
SF6 0.024654 0.054173 0.003670 ...
HFC 1.641677 2.874809 0.197846 ...
... ... ... ... ...
RoW Middle East CH4 0.677210 0.926126 0.325147 ...
N2O 0.049768 0.034912 0.020158 ...
SF6 0.002112 0.001568 0.000955 ...
HFC 0.009280 0.006824 0.004142 ...
PFC 0.011609 0.006201 0.003916 ...
region RoW Middle East \
sector Heavy_industry Construction Automobile
region stressor
FR CO2 0.580714 0.382980 0.162650
CH4 0.046371 0.114092 0.021962
N2O 0.019406 0.059560 0.007892
SF6 0.001126 0.000872 0.000270
HFC 0.073273 0.049812 0.015326
... ... ... ...
RoW Middle East CH4 2.238149 19.760153 1.079266
N2O 0.222995 2.752258 0.069067
SF6 0.009341 0.162138 0.004313
HFC 0.041137 0.702098 0.018245
PFC 0.057405 0.285898 0.007766
region \
sector Oth transport equipment Machinery Electronics
region stressor
FR CO2 0.116935 0.394273 0.080354
CH4 0.016530 0.048727 0.010756
N2O 0.004032 0.018393 0.004233
SF6 0.000166 0.000665 0.000115
HFC 0.008293 0.036774 0.006075
... ... ... ...
RoW Middle East CH4 0.139413 3.370381 0.650511
N2O 0.009559 0.247341 0.058345
SF6 0.000506 0.013730 0.003265
HFC 0.002176 0.056321 0.013685
PFC 0.001429 0.030418 0.006383
region \
sector Fossil fuels Electricity and heat Transport services
region stressor
FR CO2 0.107540 0.015568 0.058673
CH4 0.018198 0.003783 0.007705
N2O 0.006238 0.001653 0.003543
SF6 0.000204 0.000029 0.000061
HFC 0.010712 0.001534 0.003187
... ... ... ...
RoW Middle East CH4 16.407198 5.020937 2.359744
N2O 0.134513 0.432547 0.510101
SF6 0.009963 0.007036 0.012166
HFC 0.044495 0.031509 0.051611
PFC 0.008458 0.004833 0.006725
region
sector Composite
region stressor
FR CO2 0.801035
CH4 0.311628
N2O 0.150162
SF6 0.001836
HFC 0.094331
... ...
RoW Middle East CH4 119.001176
N2O 8.039872
SF6 0.941479
HFC 3.943134
PFC 0.422255
[294 rows x 833 columns]
the desagregation is defined by the list of the regions.
reg_list = ['FR', 'Austria', 'Belgium', 'Bulgaria', 'Cyprus', 'Czech Republic', 'Germany', 'Denmark', 'Estonia', 'Spain', 'Finland', 'Greece', 'Croatia', 'Hungary', 'Ireland', 'Italy', 'Lithuania', 'Luxembourg', 'Latvia', 'Malta', 'Netherlands', 'Poland', 'Portugal', 'Romania', 'Sweden', 'Slovenia', 'Slovakia', 'United Kingdom', 'United States', 'Japan', 'China', 'Canada', 'South Korea', 'Brazil', 'India', 'Mexico', 'Russia', 'Australia', 'Switzerland', 'Turkey', 'Taiwan', 'Norway', 'Indonesia', 'South Africa', 'RoW Asia and Pacific', 'RoW America', 'RoW Europe', 'RoW Africa', 'RoW Middle East']
sectors_list = ['Agriculture', 'Crude coal', 'Crude oil', 'Natural gas', 'Extractive industry', 'Biomass_industry', 'Clothing', 'Heavy_industry', 'Construction', 'Automobile', 'Oth transport equipment', 'Machinery', 'Electronics', 'Fossil fuels', 'Electricity and heat', 'Transport services', 'Composite']
The Dataframe dcba has the following index and columns :
dcba.index =
MultiIndex([( 'FR', 'CO2'),
( 'FR', 'CH4'),
( 'FR', 'N2O'),
( 'FR', 'SF6'),
( 'FR', 'HFC'),
( 'FR', 'PFC'),
( 'Austria', 'CO2'),
( 'Austria', 'CH4'),
( 'Austria', 'N2O'),
( 'Austria', 'SF6'),
...
( 'RoW Africa', 'N2O'),
( 'RoW Africa', 'SF6'),
( 'RoW Africa', 'HFC'),
( 'RoW Africa', 'PFC'),
('RoW Middle East', 'CO2'),
('RoW Middle East', 'CH4'),
('RoW Middle East', 'N2O'),
('RoW Middle East', 'SF6'),
('RoW Middle East', 'HFC'),
('RoW Middle East', 'PFC')],
names=['region', 'stressor'], length=294)
dcba.columns =
MultiIndex([( 'FR', 'Agriculture'),
( 'FR', 'Crude coal'),
( 'FR', 'Crude oil'),
( 'FR', 'Natural gas'),
( 'FR', 'Extractive industry'),
( 'FR', 'Biomass_industry'),
( 'FR', 'Clothing'),
( 'FR', 'Heavy_industry'),
( 'FR', 'Construction'),
( 'FR', 'Automobile'),
...
('RoW Middle East', 'Heavy_industry'),
('RoW Middle East', 'Construction'),
('RoW Middle East', 'Automobile'),
('RoW Middle East', 'Oth transport equipment'),
('RoW Middle East', 'Machinery'),
('RoW Middle East', 'Electronics'),
('RoW Middle East', 'Fossil fuels'),
('RoW Middle East', 'Electricity and heat'),
('RoW Middle East', 'Transport services'),
('RoW Middle East', 'Composite')],
names=['region', 'sector'], length=833)
And I would like to reaggreagte this DataFrame at a different level by grouping the regions diferently, defined here :
dict_reag =
{'United Kingdom': ['United Kingdom'],
'United States': ['United States'],
'Asia and Row Europe': ['Japan',
'India',
'Russia',
'Indonesia',
'RoW Europe'],
'Chinafrica': ['China', 'RoW Africa'],
'Turkey and RoW America': ['Canada', 'Turkey', 'RoW America'],
'Pacific and RoW Middle East': ['South Korea',
'Australia',
'Taiwan',
'RoW Middle East'],
'Brazil, Mexico and South Africa': ['Brazil', 'Mexico', 'South Africa'],
'Switzerland and Norway': ['Switzerland', 'Norway'],
'RoW Asia and Pacific': ['RoW Asia and Pacific'],
'EU': ['Austria',
'Belgium',
'Bulgaria',
'Cyprus',
'Czech Republic',
'Germany',
'Denmark',
'Estonia',
'Spain',
'Finland',
'Greece',
'Croatia',
'Hungary',
'Ireland',
'Italy',
'Lithuania',
'Luxembourg',
'Latvia',
'Malta',
'Netherlands',
'Poland',
'Portugal',
'Romania',
'Sweden',
'Slovenia',
'Slovakia'],
'FR': ['FR']}
The reaggregation process would transform this 294x833 DataFrame into a 66x187 DataFrame. Note that the new reaggreagation DataFrame corresponds to a sum of the first set of subregions.
I created an empty DataFrame with the correct new level of aggregation :
ghg_list = ['CO2', 'CH4', 'N2O', 'SF6', 'HFC', 'PFC']
multi_reg = []
multi_sec = []
for reg in list(reag_matrix.columns[2:]) :
for sec in sectors_list :
multi_reg.append(reg)
multi_sec.append(sec)
arrays = [multi_reg, multi_sec]
new_col = pd.MultiIndex.from_arrays(arrays, names=('region', 'sector'))
multi_reg2 = []
multi_ghg = []
for reg in list(reag_matrix.columns[2:]) :
for ghg in ghg_list :
multi_reg2.append(reg)
multi_ghg.append(ghg)
arrays2 = [multi_reg2, multi_ghg]
new_index = pd.MultiIndex.from_arrays(arrays2, names=('region', 'stressor'))
new_dcba = pd.DataFrame(np.zeros((len(ghg_list)*len(list(reag_matrix.columns[2:])),len(sectors_list)*len(list(reag_matrix.columns[2:])))),
index =new_index,columns = new_col)
where reag_matrix.columns[2:] corresponds to the new list of regions, as defined in dict_reag :
list(reag_matrix.columns[2:]) = ['FR', 'United Kingdom', 'United States', 'Asia and Row Europe', 'Chinafrica', 'Turkey and RoW America', 'Pacific and RoW Middle East', 'Brazil, Mexico and South Africa', 'Switzerland and Norway', 'RoW Asia and Pacific', 'EU']
I guess I could use the groupby function but I could not make it work without losing the sector desaggregation.
Otherwise, I intended to do it iteratively, but I have errors I do not understand. I first tried to copy the French block which will stay the same :
s1 = dcba.loc['FR','FR'].copy()
new_dcba.loc['FR','FR'] = s1
But this last line raises the error "The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()" although it does not seems to contain any boolean. What is my problem here ?
Also, to avoid this, I tried to use :
new_dcba.loc['FR','FR'].update(s1, overwrite=True)
But it does not change the values in new_dcba.
Finally, I tried to use .values but then a new error is raised :
new_dcba.loc['FR','FR'] = s1.values
"Must have equal len keys and value when setting with an ndarray"
So, I have two question :
Can you guess of a way to use groupby (and .sum()) for this ?
What is the issue raising the first error ?
Note that I have gone through https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy and could not find any explanation for my specific problem.
I wrote an MCVE here, and it happens to work at a reducted size (what is commented), but not at the actual size of the problem :
list_reg_mcve = reg_list #['FR', 'United States', 'United Kingdom']
list_reg_mcve_new = list_reg_reag_new #['FR','Other']
sectors_list_mcve = sectors_list #['Agriculture','Composite']
dict_mcve = dict_reag #{'FR':['FR'],'Other' : ['United States', 'United Kingdom']}
ghg_list_mcve = ['CO2', 'CH4', 'N2O', 'SF6', 'HFC', 'PFC'] #['CO2','CH4']
multi_reg = []
multi_sec = []
for reg in list_reg_mcve :
for sec in sectors_list_mcve :
multi_reg.append(reg)
multi_sec.append(sec)
arrays = [multi_reg, multi_sec]
new_col = pd.MultiIndex.from_arrays(arrays, names=('region', 'sector'))
multi_reg2 = []
multi_ghg = []
for reg in list_reg_mcve :
for ghg in ghg_list_mcve :
multi_reg2.append(reg)
multi_ghg.append(ghg)
arrays2 = [multi_reg2, multi_ghg]
new_index = pd.MultiIndex.from_arrays(arrays2, names=('region', 'stressor'))
dcba_mcve = pd.DataFrame(np.zeros((len(ghg_list_mcve)*len(list_reg_mcve),
len(sectors_list_mcve)*len(list_reg_mcve))),
index =new_index,columns = new_col)
multi_reg = []
multi_sec = []
for reg in list_reg_mcve_new :
for sec in sectors_list_mcve :
multi_reg.append(reg)
multi_sec.append(sec)
arrays = [multi_reg, multi_sec]
new_col = pd.MultiIndex.from_arrays(arrays, names=('region', 'sector'))
multi_reg2 = []
multi_ghg = []
for reg in list_reg_mcve_new :
for ghg in ghg_list_mcve :
multi_reg2.append(reg)
multi_ghg.append(ghg)
arrays2 = [multi_reg2, multi_ghg]
new_index = pd.MultiIndex.from_arrays(arrays2, names=('region', 'stressor'))
dcba_mcve_new = pd.DataFrame(np.zeros((len(ghg_list_mcve)*len(list_reg_mcve_new),
len(sectors_list_mcve)*len(list_reg_mcve_new))),
index =new_index,columns = new_col)
from random import randint
for col in dcba_mcve.columns:
dcba_mcve[col]=dcba_mcve.apply(lambda x: randint(0,5), axis=1)
print(dcba_mcve)
for reg_export in dict_mcve :
list_reg_agg_1 = dict_mcve[reg_export]
for reg_import in dict_mcve :
list_reg_agg_2 = dict_mcve[reg_import]
s1=pd.DataFrame(np.zeros_like(dcba_mcve_new.loc['FR','FR']),index=dcba_mcve_new.loc['FR','FR'].index, columns = dcba_mcve_new.loc['FR','FR'].columns)
for reg1 in list_reg_agg_1 :
for reg2 in list_reg_agg_2 :
#print(reg1,reg2)
s1 += dcba_mcve.loc[reg1,reg2].copy()
#print(s1)
dcba_mcve_new.loc[reg_export,reg_import].update(s1)
dcba_mcve_new
Thank you in advance

how to stop letter repeating itself python

I am making a code which takes in jumble word and returns a unjumbled word , the data.json contains a list and here take a word one-by-one and check if it contains all the characters of the word and later checking if the length is same , but the problem is when i enter a word as helol then the l is checked twice and giving me some other outputs including the main one(hello). i know why does it happen but i cant get a fix to it
import json
val = open("data.json")
val1 = json.load(val)#loads the list
a = input("Enter a Jumbled word ")#takes a word from user
a = list(a)#changes into list to iterate
for x in val1:#iterates words from list
for somethin in a:#iterates letters from list
if somethin in list(x):#checks if the letter is in the iterated word
continue
else:
break
else:#checks if the loop ended correctly (that means word has same letters)
if len(a) != len(list(x)):#checks if it has same number of letters
continue#returns
else:
print(x)#continues the loop to see if there are more like that
EDIT: many people wanted the json file so here it is
['Torres Strait Creole', 'good bye', 'agon', "queen's guard", 'animosity', 'price list', 'subjective', 'means', 'severe', 'knockout', 'life-threatening', 'entry into the war', 'dominion', 'damnify', 'packsaddle', 'hallucinate', 'lumpy', 'inception', 'Blankenese', 'cacophonous', 'zeptomole', 'floccinaucinihilipilificate', 'abashed', 'abacterial', 'ableism', 'invade', 'cohabitant', 'handicapped', 'obelus', 'triathlon', 'habitue', 'instigate', 'Gladstone Gander', 'Linked Data', 'seeded player', 'mozzarella', 'gymnast', 'gravitational force', 'Friedelehe', 'open up', 'bundt cake', 'riffraff', 'resourceful', 'wheedle', 'city center', 'gorgonzola', 'oaf', 'auf', 'oafs', 'galoot', 'imbecile', 'lout', 'moron', 'news leak', 'crate', 'aggregator', 'cheating', 'negative growth', 'zero growth', 'defer', 'ride back', 'drive back', 'start back', 'shy back', 'spring back', 'shrink back', 'shy away', 'abderian', 'unable', 'font manager', 'font management software', 'consortium', 'gown', 'inject', 'ISO 639', 'look up', 'cross-eyed', 'squinting', 'health club', 'fitness facility', 'steer', 'sunbathe', 'combatives', 'HTH', 'hope that helps', 'How The Hell', 'distributed', 'plum cake', 'liberalization', 'macchiato', 'caffè macchiato', 'beach volley', 'exult', 'jubilate', 'beach volleyball', 'be beached', 'affogato', 'gigabyte', 'terabyte', 'petabyte', 'undressed', 'decameter', 'sensual', 'boundary marker', 'poor man', 'cohabitee', 'night sleep', 'protruding ears', 'three quarters of an hour', 'spermophilus', 'spermophilus stricto sensu', "devil's advocate", 'sacred king', 'sacral king', 'myr', 'million years', 'obtuse-angled', 'inconsolable', 'neurotic', 'humiliating', 'mortifying', 'theological', 'rematch', 'varıety', 'be short', 'ontological', 'taxonomic', 'taxonomical', 'toxicology testing', 'on the job training', 'boulder', 'unattackable', 'inviolable', 'resinous', 'resiny', 'ionizing radiation', 'citrus grove', 'comic book shop', 'preparatory measure', 'written account', 'brittle', 'locker', 'baozi', 'bao', 'bau', 'humbow', 'nunu', 'bausak', 'pow', 'pau', 'yesteryear', 'fire drill', 'rotted', 'putto', 'overthrow', 'ankle monitor', 'somewhat stupid', 'a little stupid', 'semordnilap', 'pangram', 'emordnilap', 'person with a sunlamp tan', 'tittle', 'incompatible', 'autumn wind', 'dairyman', 'chesty', 'lacustrine', 'chronophotograph', 'chronophoto', 'leg lace', 'ankle lace', 'ankle lock', 'Babelfy', 'ventricular', 'recurrent', 'long-lasting', 'long-standing', 'long standing', 'sea bass', 'reap', 'break wind', 'chase away', 'spark', 'speckle', 'take back', 'Westphalian', 'Aeolic Greek', 'startup', 'abseiling', 'impure', 'bottle cork', 'paralympic', 'work out', 'might', 'ice-cream man', 'ice cream man', 'ice cream maker', 'ice-cream maker', 'traveling', 'special delivery', 'prizefighter', 'abs', 'ab', 'churro', 'pilfer', 'dehumanize', 'fertilize', 'inseminate', 'digitalize', 'fluke', 'stroke of luck', 'decontaminate', 'abandonware', 'manzanita', 'tule', 'jackrabbit', 'system administrator', 'system admin', 'springtime lethargy', 'Palatinean', 'organized religion', 'bearing puller', 'wheel puller', 'gear puller', 'shot', 'normalize', 'palindromic', 'lancet window', 'terminological', 'back of head', 'dragon food', 'barbel', 'Central American Spanish', 'basis', 'birthmark', 'blood vessel', 'ribes', 'dog-rose', 'dreadful', 'freckle', 'free of charge', 'weather verb', 'weather sentence', 'gipsy', 'gypsy', 'glutton', 'hump', 'low voice', 'meek', 'moist', 'river mouth', 'turbid', 'multitude', 'palate', 'peak of mountain', 'poetry', 'pure', 'scanty', 'spicy', 'spicey', 'spruce', 'surface', 'infected', 'copulate', 'dilute', 'dislocate', 'grow up', 'hew', 'hinder', 'infringe', 'inhabit', 'marry off', 'offend', 'pass by', 'brother of a man', 'brother of a woman', 'sister of a man', 'sister of a woman', 'agricultural farm', 'result in', 'rebel', 'strew', 'scatter', 'sway', 'tread', 'tremble', 'hog', 'circuit breaker', 'Southern Quechua', 'safety pin', 'baby pin', 'college student', 'university student', 'pinus sibirica', 'Siberian pine', 'have lunch', 'floppy', 'slack', 'sloppy', 'wishi-washi', 'turn around', 'bogeyman', 'selfish', 'Talossan', 'biomembrane', 'biological membrane', 'self-sufficiency', 'underevaluation', 'underestimation', 'opisthenar', 'prosody', 'Kumhar Bhag Paharia', 'psychoneurotic', 'psychoneurosis', 'levant', "couldn't-care-less attitude", 'noctambule', 'acid-free paper', 'decontaminant', 'woven', 'wheaten', 'waste-ridden', 'war-ridden', 'violence-ridden', 'unwritten', 'typewritten', 'spoken', 'abiogenetically', 'rasp', 'abstractly', 'cyclically', 'acyclically', 'acyclic', 'ad hoc', 'spare tire', 'spare wheel', 'spare tyre', 'prefabricated', 'ISO 9000', 'Barquisimeto', 'Maracay', 'Ciudad Guayana', 'San Cristobal', 'Barranquilla', 'Arequipa', 'Trujillo', 'Cusco', 'Callao', 'Cochabamba', 'Goiânia', 'Campinas', 'Fortaleza', 'Florianópolis', 'Rosario', 'Mendoza', 'Bariloche', 'temporality', 'papyrus sedge', 'paper reed', 'Indian matting plant', 'Nile grass', 'softly softly', 'abductive reasoning', 'abductive inference', 'retroduction', 'Salzburgian', 'cymotrichous', 'access point', 'wireless access point', 'dynamic DNS', 'IP address', 'electrolyte', 'helical', 'hydrometer', 'intranet', 'jumper', 'MAC address', 'Media Access Control address', 'nickel–cadmium battery', 'Ni-Cd battery', 'oscillograph', 'overload', 'photovoltaic', 'photovoltaic cell', 'refractor telescope', 'autosome', 'bacterial artificial chromosome', 'plasmid', 'nucleobase', 'base pair', 'base sequence', 'chromosomal deletion', 'deletion', 'deletion mutation', 'gene deletion', 'chromosomal inversion', 'comparative genomics', 'genomics', 'cytogenetics', 'DNA replication', 'DNA repair', 'DNA sequence', 'electrophoresis', 'functional genomics', 'retroviral', 'retroviral infection', 'acceptance criteria', 'batch processing', 'business rule', 'code review', 'configuration management', 'entity–relationship model', 'lifecycle', 'object code', 'prototyping', 'pseudocode', 'referential', 'reusability', 'self-join', 'timestamp', 'accredited', 'accredited translator', 'certify', 'certified translation', 'computer-aided design', 'computer-aided', 'computer-assisted', 'management system', 'computer-aided translation', 'computer-assisted translation', 'machine-aided translation', 'conference interpreter', 'freelance translator', 'literal translation', 'mother-tongue', 'whispered interpreting', 'simultaneous interpreting', 'simultaneous interpretation', 'base anhydride', 'binary compound', 'absorber', 'absorption coefficient', 'attenuation coefficient', 'active solar heater', 'ampacity', 'amorphous semiconductor', 'amorphous silicon', 'flowerpot', 'antireflection coating', 'antireflection', 'armored cable', 'electric arc', 'breakdown voltage','casing', 'facing', 'lining', 'assumption of Mary', 'auscultation']
Just a example and the dictionary is full of items
As I understand it you are trying to identify all possible matches for the jumbled string in your list. You could sort the letters in the jumbled word and match the resulting list against sorted lists of the words in your data file.
sorted_jumbled_word = sorted(a)
for word in val1:
if len(sorted_jumbled_word) == len(word) and sorted(word) == sorted_jumbled_word:
print(word)
Checking by length first reduces unnecessary sorting. If doing this repeatedly, you might want to create a dictionary of the words in the data file with their sorted versions, to avoid having to repeatedly sort them.
There are spaces and punctuation in some of the terms in your word list. If you want to make the comparison ignoring spaces then remove them from both the jumbled word and the list of unjumbled words, using e.g. word = word.replace(" ", "")

Find the anagram pairs of from 2 lists and create a list of tuples of the anagrams

say I have two lists
list_1 = [ 'Tar', 'Arc', 'Elbow', 'State', 'Cider', 'Dusty', 'Night', 'Inch', 'Brag', 'Cat', 'Bored', 'Save', 'Angel','bla', 'Stressed', 'Dormitory', 'School master','Awesoame', 'Conversation', 'Listen', 'Astronomer', 'The eyes', 'A gentleman', 'Funeral', 'The Morse Code', 'Eleven plus two', 'Slot machines', 'Fourth of July', 'Jim Morrison', 'Damon Albarn', 'George Bush', 'Clint Eastwood', 'Ronald Reagan', 'Elvis', 'Madonna Louise Ciccone', 'Bart', 'Paris', 'San Diego', 'Denver', 'Las Vegas', 'Statue of Liberty']
and
list_B = ['Cried', 'He bugs Gore', 'They see', 'Lives', 'Joyful Fourth', 'The classroom', 'Diagnose', 'Silent', 'Taste', 'Car', 'Act', 'Nerved', 'Thing', 'A darn long era', 'Brat', 'Twelve plus one', 'Elegant man', 'Below', 'Robed', 'Study', 'Voices rant on', 'Chin', 'Here come dots', 'Real fun', 'Pairs', 'Desserts', 'Moon starer', 'Dan Abnormal', 'Old West action', 'Built to stay free', 'One cool dance musician', 'Dirty room', 'Grab', 'Salvages', 'Cash lost in me', "Mr. Mojo Risin'", 'Glean', 'Rat', 'Vase']
What I am looking for is to find the anagram pairs of list_A in list_B. Create a list of tuples of the anagrams.
For one list I can do the following and generate the list of tuples, however, for two lists I need some assistance. Thanks in advance for the help!
What I have tried for one list,
from collections import defaultdict
anagrams = defaultdict(list)
for w in list_A:
anagrams[tuple(sorted(w))].append(w)
You can use a nested for loop, outer for the first list, inner for the second (also, use str.lower to make it case-insensitive):
anagram_pairs = [] # (w_1 from list_A, w_2 from list_B)
for w_1 in list_A:
for w_2 in list_B:
if sorted(w_1.lower()) == sorted(w_2.lower()):
anagram_pairs.append((w_1, w_2))
print(anagram_pairs)
Output:
[('Tar', 'Rat'), ('Arc', 'Car'), ('Elbow', 'Below'), ('State', 'Taste'), ('Cider', 'Cried'), ('Dusty', 'Study'), ('Night', 'Thing'), ('Inch', 'Chin'), ('Brag', 'Grab'), ('Cat', 'Act'), ('Bored', 'Robed'), ('Save', 'Vase'), ('Angel', 'Glean'), ('Stressed', 'Desserts'), ('School master', 'The classroom'), ('Listen', 'Silent'), ('The eyes', 'They see'), ('A gentleman', 'Elegant man'), ('The Morse Code', 'Here come dots'), ('Eleven plus two', 'Twelve plus one'), ('Damon Albarn', 'Dan Abnormal'), ('Elvis', 'Lives'), ('Bart', 'Brat'), ('Paris', 'Pairs'), ('Denver', 'Nerved')]
You are quite close with your current attempt. All you need to do is repeat the same process on list_B:
from collections import defaultdict
anagrams = defaultdict(list)
list_A = [ 'Tar', 'Arc', 'Elbow', 'State', 'Cider', 'Dusty', 'Night', 'Inch', 'Brag', 'Cat', 'Bored', 'Save', 'Angel','bla', 'Stressed', 'Dormitory', 'School master','Awesoame', 'Conversation', 'Listen', 'Astronomer', 'The eyes', 'A gentleman', 'Funeral', 'The Morse Code', 'Eleven plus two', 'Slot machines', 'Fourth of July', 'Jim Morrison', 'Damon Albarn', 'George Bush', 'Clint Eastwood', 'Ronald Reagan', 'Elvis', 'Madonna Louise Ciccone', 'Bart', 'Paris', 'San Diego', 'Denver', 'Las Vegas', 'Statue of Liberty']
list_B = ['Cried', 'He bugs Gore', 'They see', 'Lives', 'Joyful Fourth', 'The classroom', 'Diagnose', 'Silent', 'Taste', 'Car', 'Act', 'Nerved', 'Thing', 'A darn long era', 'Brat', 'Twelve plus one', 'Elegant man', 'Below', 'Robed', 'Study', 'Voices rant on', 'Chin', 'Here come dots', 'Real fun', 'Pairs', 'Desserts', 'Moon starer', 'Dan Abnormal', 'Old West action', 'Built to stay free', 'One cool dance musician', 'Dirty room', 'Grab', 'Salvages', 'Cash lost in me', "Mr. Mojo Risin'", 'Glean', 'Rat', 'Vase']
for w in list_A:
anagrams[tuple(sorted(w))].append(w)
for w in list_B:
anagrams[tuple(sorted(w))].append(w)
result = [b for b in anagrams.values() if len(b) > 1]
Output:
[['Cider', 'Cried'], ['The eyes', 'They see'], ['Damon Albarn', 'Dan Abnormal'], ['Bart', 'Brat'], ['Paris', 'Pairs']]
Another solution using dictionary:
out = {}
for word in list_A:
out.setdefault(tuple(sorted(word.lower())), []).append(word)
for word in list_B:
word_s = tuple(sorted(word.lower()))
if word_s in out:
out[word_s].append(word)
print(list(tuple(v) for v in out.values() if len(v) > 1))
Prints:
[
("Tar", "Rat"),
("Arc", "Car"),
("Elbow", "Below"),
("State", "Taste"),
("Cider", "Cried"),
("Dusty", "Study"),
("Night", "Thing"),
("Inch", "Chin"),
("Brag", "Grab"),
("Cat", "Act"),
("Bored", "Robed"),
("Save", "Vase"),
("Angel", "Glean"),
("Stressed", "Desserts"),
("School master", "The classroom"),
("Listen", "Silent"),
("The eyes", "They see"),
("A gentleman", "Elegant man"),
("The Morse Code", "Here come dots"),
("Eleven plus two", "Twelve plus one"),
("Damon Albarn", "Dan Abnormal"),
("Elvis", "Lives"),
("Bart", "Brat"),
("Paris", "Pairs"),
("Denver", "Nerved"),
]

Capital city that starts with "a", and ends with "a". Doesn't matter if letter "a" is uppercase or lowercase

Start with a, and ends with a. I have been trying to output capital cities that start and end with the letter "a". Doesn't matter if they start with capital "A"
capitals = ('Kabul', 'Tirana (Tirane)', 'Algiers', 'Andorra la Vella', 'Luanda', "Saint John's", 'Buenos Aires', 'Yerevan', 'Canberra', 'Vienna', 'Baku', 'Nassau', 'Manama', 'Dhaka', 'Bridgetown', 'Minsk', 'Brussels', 'Belmopan', 'Porto Novo', 'Thimphu', 'Sucre', 'Sarajevo', 'Gaborone', 'Brasilia', 'Bandar Seri Begawan', 'Sofia', 'Ouagadougou', 'Gitega', 'Phnom Penh', 'Yaounde', 'Ottawa', 'Praia', 'Bangui', "N'Djamena", 'Santiago', 'Beijing', 'Bogota', 'Moroni', 'Kinshasa', 'Brazzaville', 'San Jose', 'Yamoussoukro', 'Zagreb', 'Havana', 'Nicosia', 'Prague', 'Copenhagen', 'Djibouti', 'Roseau', 'Santo Domingo', 'Dili', 'Quito', 'Cairo', 'San Salvador', 'London', 'Malabo', 'Asmara', 'Tallinn', 'Mbabana', 'Addis Ababa', 'Palikir', 'Suva', 'Helsinki', 'Paris', 'Libreville', 'Banjul', 'Tbilisi', 'Berlin', 'Accra', 'Athens', "Saint George's", 'Guatemala City', 'Conakry', 'Bissau', 'Georgetown', 'Port au Prince', 'Tegucigalpa', 'Budapest', 'Reykjavik', 'New Delhi', 'Jakarta', 'Tehran', 'Baghdad', 'Dublin', 'Jerusalem', 'Rome', 'Kingston', 'Tokyo', 'Amman', 'Nur-Sultan', 'Nairobi', 'Tarawa Atoll', 'Pristina', 'Kuwait City', 'Bishkek', 'Vientiane', 'Riga', 'Beirut', 'Maseru', 'Monrovia', 'Tripoli', 'Vaduz', 'Vilnius', 'Luxembourg', 'Antananarivo', 'Lilongwe', 'Kuala Lumpur', 'Male', 'Bamako', 'Valletta', 'Majuro', 'Nouakchott', 'Port Louis', 'Mexico City', 'Chisinau', 'Monaco', 'Ulaanbaatar', 'Podgorica', 'Rabat', 'Maputo', 'Nay Pyi Taw', 'Windhoek', 'No official capital', 'Kathmandu', 'Amsterdam', 'Wellington', 'Managua', 'Niamey', 'Abuja', 'Pyongyang', 'Skopje', 'Belfast', 'Oslo', 'Muscat', 'Islamabad', 'Melekeok', 'Panama City', 'Port Moresby', 'Asuncion', 'Lima', 'Manila', 'Warsaw', 'Lisbon', 'Doha', 'Bucharest', 'Moscow', 'Kigali', 'Basseterre', 'Castries', 'Kingstown', 'Apia', 'San Marino', 'Sao Tome', 'Riyadh', 'Edinburgh', 'Dakar', 'Belgrade', 'Victoria', 'Freetown', 'Singapore', 'Bratislava', 'Ljubljana', 'Honiara', 'Mogadishu', 'Pretoria, Bloemfontein, Cape Town', 'Seoul', 'Juba', 'Madrid', 'Colombo', 'Khartoum', 'Paramaribo', 'Stockholm', 'Bern', 'Damascus', 'Taipei', 'Dushanbe', 'Dodoma', 'Bangkok', 'Lome', "Nuku'alofa", 'Port of Spain', 'Tunis', 'Ankara', 'Ashgabat', 'Funafuti', 'Kampala', 'Kiev', 'Abu Dhabi', 'London', 'Washington D.C.', 'Montevideo', 'Tashkent', 'Port Vila', 'Vatican City', 'Caracas', 'Hanoi', 'Cardiff', "Sana'a", 'Lusaka', 'Harare')
This is my code:
for elem in capitals:
elem = elem.lower()
["".join(j for j in i if j not in string.punctuation) for i in capitals]
if (len(elem) >=4 and elem.endswith(elem[0])):
print(elem)
My output is:
andorra la vella
saint john's
asmara
addis ababa
accra
saint george's
nur-sultan
abuja
oslo
warsaw
apia
ankara
tashkent
My expected output is:
andorra la vella
asmara
addis ababa
accra
abuja
apia
ankara
You didn't check if the capital starts with 'a'. I also assumed you want to filter out punctuation based on your code, so this is what I ended up with:
import string
for elem in capitals:
elem = elem.lower()
for punct in string.punctuation:
elem = elem.replace(punct, '')
if elem.startswith('a') and elem.endswith('a'):
print(elem)
for elem in capitals:
elem = elem.lower()
if (elem.startswith('a') and elem.endswith('a')):
print(elem)

XPath - extracting table data with irregular pattern

Extending an existing question and answer here, I am trying to extract player name and his position. The output would like:
playername, position
EJ Manuel, Quarterbacks
Tyrod Taylor, Quarterbacks
Anthony Dixon, Running backs
...
This is what I have done so far:
tree = html.fromstring(requests.get("https://en.wikipedia.org/wiki/List_of_current_AFC_team_rosters").text)
for h3 in tree.xpath("//table[#class='toccolours']//tr[2]"):
position = h3.xpath(".//b/text()")
players = h3.xpath(".//ul/li/a/text()")
print(position, players)
The above codes can deliver the following, but not in the format I need.
(['Quarterbacks', 'Running backs', 'Wide receivers', 'Tight ends', 'Offensive linemen', 'Defensive linemen', 'Linebackers', 'Defensive backs', 'Special teams', 'Reserve lists', 'Unrestricted FAs', 'Restricted FAs', 'Exclusive-Rights FAs'], ['EJ Manuel', 'Tyrod Taylor', 'Anthony Dixon', 'Jerome Felton', 'Mike Gillislee', 'LeSean McCoy', 'Karlos Williams', 'Leonard Hankerson', 'Marcus Easley', 'Marquise Goodwin', 'Percy Harvin', 'Dez Lewis', 'Walt Powell', 'Greg Salas', 'Sammy Watkins', 'Robert Woods', 'Charles Clay', 'Chris Gragg', "Nick O'Leary", 'Tyson Chandler', 'Ryan Groy', 'Seantrel Henderson', 'Cyrus Kouandjio', 'John Miller', 'Kraig Urbik', 'Eric Wood', 'T. J. Barnes', 'Marcell Dareus', 'Lavar Edwards', 'IK Enemkpali', 'Jerry Hughes', 'Kyle Williams', 'Mario Williams', 'Jerel Worthy', 'Jarius Wynn', 'Preston Brown', 'Randell Johnson', 'Manny Lawson', 'Kevin Reddick', 'Tony Steward', 'A. J. Tarpley', 'Max Valles', 'Mario Butler', 'Ronald Darby', 'Stephon Gilmore', 'Corey Graham', 'Leodis McKelvin', 'Jonathan Meeks', 'Merrill Noel', 'Nickell Robey', 'Sammy Seamster', 'Cam Thomas', 'Aaron Williams', 'Duke Williams', 'Dan Carpenter', 'Jordan Gay', 'Garrison Sanborn', 'Colton Schmidt', 'Blake Annen', 'Jarrett Boykin', 'Jonathan Dowling', 'Greg Little', 'Jacob Maxwell', 'Ronald Patrick', 'Cedric Reed', 'Cyril Richardson', 'Phillip Thomas', 'James Wilder, Jr.', 'Nigel Bradham', 'Ron Brooks', 'Alex Carrington', 'Cordy Glenn', 'Leonard Hankerson', 'Richie Incognito', 'Josh Johnson', 'Corbin Bryant', 'Stefan Charles', 'MarQueis Gray', 'Chris Hogan', 'Jordan Mills', 'Ty Powell', 'Bacarri Rambo', 'Cierre Wood'])
(['Quarterbacks', 'Running backs', 'Wide receivers', 'Tight ends', 'Offensive linemen', 'Defensive linemen', 'Linebackers', 'Defensive backs', 'Special teams', 'Reserve lists', 'Unrestricted FAs', 'Restricted FAs', 'Exclusive-Rights FAs'], ['Zac Dysert', 'Ryan Tannehill', 'Logan Thomas', 'Jay Ajayi', 'Jahwan Edwards', 'Damien Williams', 'Tyler Davis', 'Robert Herron', 'Greg Jennings', 'Jarvis Landry', 'DeVante Parker', 'Kenny Stills', 'Jordan Cameron', 'Dominique Jones', 'Dion Sims', 'Branden Albert', 'Jamil Douglas', "Ja'Wuan James", 'Vinston Painter', 'Mike Pouncey', 'Anthony Steen', 'Dallas Thomas', 'Billy Turner', 'Deandre Coleman', 'Quinton Coples', 'Terrence Fede', 'Dion Jordan', 'Earl Mitchell', 'Damontre Moore', 'Jordan Phillips', 'Ndamukong Suh', 'Charles Tuaau', 'Robert Thomas', 'Cameron Wake', 'Julius Warmsley', 'Jordan Williams', 'Neville Hewitt', 'Mike Hull', 'Jelani Jenkins', 'Terrell Manning', 'Chris McCain', 'Koa Misi', 'Zach Vigil', 'Walt Aikens', 'Damarr Aultman', 'Brent Grimes', 'Reshad Jones', 'Tony Lippett', 'Bobby McCain', 'Brice McCain', 'Tyler Patmon', 'Dax Swanson', 'Jamar Taylor', 'Matt Darr', 'John Denney', 'Andrew Franks', 'Louis Delmas', 'James-Michael Johnson', 'Rishard Matthews', 'Jacques McClendon', 'Lamar Miller', 'Matt Moore', 'Spencer Paysinger', 'Derrick Shelby', 'Kelvin Sheppard', 'Shelley Smith', 'Olivier Vernon', 'Michael Thomas', 'Brandon Williams', 'Shamiel Gary', 'Matt Hazel', 'Ulrick John', 'Jake Stoneburner'])
...
Any suggestions?
You can use nested loop for this task. First loop through the positions and then, for each position, loop through the corresponding players :
#loop through positions
for b in tree.xpath("//table[#class='toccolours']//tr[2]//b"):
#get current position text
position = b.xpath("text()")[0]
#get players that correspond to the current position
for a in b.xpath("following::ul[1]/li/a[not(*)]"):
#get current player text
player = a.xpath("text()")[0]
#print current position and player together
print(position, player)
Last part of the output :
.....
('Reserve lists', 'Chris Watt')
('Reserve lists', 'Eric Weddle')
('Reserve lists', 'Tourek Williams')
('Practice squad', 'Alex Bayer')
('Practice squad', 'Isaiah Burse')
('Practice squad', 'Richard Crawford')
('Practice squad', 'Ben Gardner')
('Practice squad', 'Michael Huey')
('Practice squad', 'Keith Lewis')
('Practice squad', 'Chuka Ndulue')
('Practice squad', 'Tim Semisch')
('Practice squad', 'Brad Sorensen')
('Practice squad', 'Craig Watts')

Categories

Resources