Python DataFrame column with list of strings does not flatten - python

I have a column in a DataFrame (production_company) which has a list of strings that are production companies for a movie. I want to search for all unique occurrence of a production company across all movies.
In the data below I have given a sample of the column values in production_company.
"['Universal Studios', 'Amblin Entertainment', 'Legendary Pictures', 'Fuji Television Network', 'Dentsu']"
"['Village Roadshow Pictures', 'Kennedy Miller Productions']"
"['Summit Entertainment', 'Mandeville Films', 'Red Wagon Entertainment', 'NeoReel']"
"['Lucasfilm', 'Truenorth Productions', 'Bad Robot']"
"['Universal Pictures', 'Original Film', 'Media Rights Capital', 'Dentsu', 'One Race Films']"
"['Regency Enterprises', 'Appian Way', 'CatchPlay', 'Anonymous Content', 'New Regency Pictures']"
I am trying to first flatten the column using a solution to flatten given in Pandas Series of lists to one series
But I get error 'TypeError: 'float' object is not iterable'
17 slist =[]
18 for company in production_companies:
---> 19 slist.extend(company )
20
21
TypeError: 'float' object is not iterable
production_companies holds the column df['production_company']
Company is a list so why is it taking it as float? Even list comprehension gives the same error: flattened_list = [y for x in production_companies for y in x]

You can use collections.Counter to count items. I would split the task into 3 steps:
Convert series of strings into a series of lists via ast.literal_eval.
Use itertools.chain to form an iterable of companies and feed to Counter.
Use a dictionary comprehension to filter for companies with a count of 1.
Here's a demo:
from ast import literal_eval
from itertools import chain
from collections import Counter
s = df['companies'].map(literal_eval)
c = Counter(chain.from_iterable(s))
c_filtered = {k for k, v in c.items() if v == 1}
Result:
print(c_filtered)
['Village Roadshow Pictures', 'Kennedy Miller Productions',
...
'Truenorth Productions', 'Regency Enterprises']

Related

Returns a dataframe with a list of files containing the word

I have a dataframe:
business049.txt [bmw, cash, fuel, mini, product, less, mini]
business470.txt [saudi, investor, pick, savoy, london, famou]
business075.txt [eu, minist, mull, jet, fuel, tax, european]
business101.txt [australia, rate, australia, rais, benchmark]
business060.txt [insur, boss, plead, guilti, anoth, us, insur]
Therefore, I would like the output to include a column of words and a column of filenames that contain it. It should be like:
bmw [business049.txt,business055.txt]
australia [business101.txt,business141.txt]
Thank you
This is quite possibly not the most efficient/best way to do this, but here you go:
# Create DataFrame from question
df = pd.DataFrame({
'txt_file': ['business049.txt',
'business470.txt',
'business075.txt',
'business101.txt',
'business060.txt',
],
'words': [
['bmw', 'cash', 'fuel', 'mini', 'product', 'less', 'mini'],
['saudi', 'investor', 'pick', 'savoy', 'london', 'famou'],
['eu', 'minist', 'mull', 'jet', 'fuel', 'tax', 'european'],
['australia', 'rate', 'australia', 'rais', 'benchmark'],
['insur', 'boss', 'plead', 'guilti', 'anoth', 'us', 'insur'],
]
})
# Get all unique words in a list
word_list = list(set(df['words'].explode()))
# Link txt files to unique words
# Note: list of txt files is one string comma separated to ensure single column in resulting DataFrame
word_dict = {
unique_word: [', '.join(df[df['words'].apply(lambda list_of_words: unique_word in list_of_words)]['txt_file'])] for unique_word in word_list
}
# Create DataFrame from dictionary (transpose to have words as row index).
words_in_files = pd.DataFrame(word_dict).transpose()
The dictionary word_dict might already be exactly what you need instead of holding on to a DataFrame just for the sake of using a DataFrame. If that is the case, remove the ', '.join() part from the dictionary creation, because it doesn't matter that the values of your dict are unequal in length.

Get All Row Values After Split and Put Them In List

UPDATED: I've the following DataFrame:
df = pd.DataFrame({'sports': ["['soccer', 'men tennis']", "['soccer']", "['baseball', 'women tennis']"]})
print(df)
sports
0 ['soccer', 'men tennis']
1 ['soccer']
2 ['baseball', 'women tennis']
I need to extract all the unique sport names and put them into a list. I'm trying the following code:
out = pd.DataFrame(df['sports'].str.split(',').tolist()).stack()
out.value_counts().index
However, it's returning Nan values.
Desired output:
['soccer', 'men tennis', 'baseball', 'women tennis']
What would be the smartest way of doing it? Any suggestions would be appreciated. Thanks!
If these are lists, then you could explode + unique:
out = df['sports'].explode().unique().tolist()
If these are strings, then you could use ast.literal_eval first to parse it:
import ast
out = df['sports'].apply(ast.literal_eval).explode().unique().tolist()
or use ast.literal_eval in a set comprehension and unpack:
out = [*{x for lst in df['sports'].tolist() for x in ast.literal_eval(lst)}]
Output:
['soccer', 'men tennis', 'baseball', 'women tennis']
Assuming the type of values stored in sports column is list, we can flatten the column using hstack, then use set to get unique values
set(np.hstack(df['sports']))
{'baseball', 'men tennis', 'soccer', 'women tennis'}
lst = []
df['sports'].apply(lambda x: [lst.append(element) for element in x])
lst = list(set(lst))
Not sure how efficient is this, but works.

How to extract list of dictionaries from Pandas column

I have the following dataframe that I extracted from an API, inside that df there is a column that I need to extract data from it, but the structure of that data inside that column is a list of dictionaries:
I could get the data that I care from that dictionary using this chunk of code:
for k,v in d.items():
for i,j in v.items():
if isinstance(j, list):
for l in range(len(j)):
for k in j[l]:
print(j[l])
I get a structure like this one, so I´d need to get each of that 'values' inside the list of dictionaries
and then organize them in a dataframe. like for example the first item on the list of dictionaries:
Once I get to the point of getting the above structure, how could I make a dataframe like the one in the image?
Raw data:
data = {'rows': [{'values': ['Tesla Inc (TSLA)', '$1056.78', '$1199.78', '13.53%'], 'children': []}, {'values': ['Taiwan Semiconductor Manufacturing Company Limited (TSM)', '$120.31', '$128.80', '7.06%'], 'children': []}]}
You can use pandas. First cast your data to pd.DataFrame, then use apply(pd.Series) to expand lists inside 'values' column to separate columns and set_axis method to change column names:
import pandas as pd
data = {'rows': [{'values': ['Tesla Inc (TSLA)', '$1056.78', '$1199.78', '13.53%'], 'children': []}, {'values': ['Taiwan Semiconductor Manufacturing Company Limited (TSM)', '$120.31', '$128.80', '7.06%'], 'children': []}]}
out = pd.DataFrame(data['rows'])['values'].apply(pd.Series).set_axis(['name','price','price_n','pct'], axis=1)
Output:
name price price_n pct
0 Tesla Inc (TSLA) $1056.78 $1199.78 13.53%
1 Taiwan Semiconductor Manufacturing Company Lim... $120.31 $128.80 7.06%

Can we to refer to a dictionary get a value from the key while replacing in Python?

I have a flat file with terms and sentences. If any term is found in the sentence, I need to append its id to the term (term|id). Pattern match should be case insensitive. Also, we need to retain the same case as in the sentence. Is it possible to refer to dictionary to get the value using it's key in a replace call?
from pandas import DataFrame
import re
df = {'id':[11,12,13,14,15,16],
'term': ['Ford', 'EXpensive', 'TOYOTA', 'Mercedes Benz', 'electric', 'cars'],
'sentence': ['F-FORD FORD/FORD is less expensive than Mercedes Benz.' ,'toyota, hyundai mileage is good compared to ford','tesla is an electric-car','toyota too has electric cars','CARS','CArs are expensive.']
}
#Dataframe creation
df = DataFrame(df,columns= ['id','term','sentence'])
#Dictionary creation
dict = {}
l_term = list(df['term'])
l_id = list(df['id'])
for i,j in zip(l_term,l_id):
dict[str(i)] = j
#Building patterns to replace
pattern = r'(?i)(?<!-)(?<!\w)(?:{})(?!\w)'.format('|'.join(map(re.escape, sorted(df["term"],key=len,reverse=True))))
#Replace
df["sentence"].replace(pattern, r"\g<0>|present",, inplace=True,regex=True)
Instead of |present I need to refer to dictionary like |dict.get(\g<0>) or is there any other approach to achieve this? Also, if we found cars twice for 16,17. We can append either one.
The expected outcome is
F-FORD FORD|11/FORD|11 is less expensive|12 than Mercedes Benz|14.
toyota|13, hyundai mileage is good compared to ford|11
tesla is an electric|15-car
toyota|13 too has electric|15 cars|16
CARS|16
CArs|16 are expensive|12.
You may use a slight modification of the current code:
from pandas import DataFrame
import re
df = {'id':[11,12,13,14,15,16],
'term': ['Ford', 'EXpensive', 'TOYOTA', 'Mercedes Benz', 'electric', 'cars'],
'sentence': ['F-FORD FORD/FORD is less expensive than Mercedes Benz.' ,'toyota, hyundai mileage is good compared to ford','tesla is an electric-car','toyota too has electric cars','CARS','CArs are expensive.']
}
#Dataframe creation
df = DataFrame(df,columns= ['id','term','sentence'])
#Dictionary creation
dct = {}
l_term = list(df['term'])
l_id = list(df['id'])
for i,j in zip(l_term,l_id):
dct[str(i).upper()] = j
#Building patterns to replace
pattern = r'(?i)(?<!-)(?<!\w)(?:{})(?!\w)'.format('|'.join(map(re.escape, sorted(df["term"],key=len,reverse=True))))
#Replace
df["sentence"]=df["sentence"].str.replace(pattern, lambda x: "{}|{}".format(x.group(),dct[x.group().upper()]))
NOTES:
dict is a reserved name, do not name variables dict, use dct
dct[str(i).upper()] = j - the uppercased key is added to the dictionary to enable case insensitive search by key in the dictionary
df["sentence"]=df["sentence"].str.replace(pattern, lambda x: "{}|{}".format(x.group(),dct[x.group().upper()])) is the main (last) line, it uses Series.str.replace that allows using a callable as the replacement argument and once the pattern matches, the match is passed to the lambda expression as x Match object where the value is retrieved with dct[x.group().upper()] and the whole match is accessed with x.group().

sorting and grouping multiple dictionaries in python

I have an unknown number of dictionaries each identified by a specific code.
All of the values are create dynamically, so the codes to group by are unknown.
I am hoping someone might be able to help me identify the best way to group the dictionaries so that I can then move through each to produce a table. There are actually about 7 items in the dictionary.
Each dictionary is a row in the table.
example:
results = ['GROUP1':{'name':'Sam', 'code':'CDZ', 'cat_name':'category1', 'cat_code':'GROUP1'}, 'GROUP1':{'name':'James', 'code':'CDF', 'cat_name':'category1', 'cat_code':'GROUP1'}, 'GROUP2':{'name':'Ellie', 'code':'CDT', 'cat_name':'category2', 'cat_code':'GROUP2'}]
I want to be able to format these dictionaries into a table using to produce the following:
GROUP1 - category1
CODE | NAME
CDZ | Sam
CDF | James
GROUP2 - category2
CODE | NAME
CDT | Ellie
Thanks so much in advance.
If results is a list of dicts like this
>>> results = [
... {'name': 'Sam', 'code': 'CDZ', 'cat_name': 'category1', 'cat_code': 'GROUP1'},
... {'name': 'James', 'code': 'CDF', 'cat_name': 'category1', 'cat_code': 'GROUP1'},
... {'name': 'Ellie', 'code': 'CDT', 'cat_name': 'category2', 'cat_code': 'GROUP2'}]
>>> from collections import defaultdict
>>> D = defaultdict(list)
>>> for item in results:
... D[item['cat_code'], item['cat_name']].append((item['code'], item['name']))
...
>>> import pprint
>>> pprint.pprint(dict(D))
{('GROUP1', 'category1'): [('CDZ', 'Sam'), ('CDF', 'James')],
('GROUP2', 'category2'): [('CDT', 'Ellie')]}
You can iterate through D.items() and do whatever you like
Your first two problems are that your example input is not valid Python (the dict literal syntax is {}, not []), and you cannot have two identical keys in one dict. Until you fix these, we can't do anything. I'm not sure how you'd want to fix the duplicate keys problem--maybe you actually want an array of dicts instead of a dict of dicts?

Categories

Resources