Use JSON column into a Pattern using Pandas

Use JSON column into a Pattern using Pandas - python

I have a JSON files Data. Given below is a sample of it.
[{
"Type": "Fruit",
"Names": "Apple;Orange;Papaya"
}, {
"Type": "Veggie",
"Names": "Cucumber;Spinach;Tomato"
}]
I have to read the Names and match each item of the Names with a column in another df.
I am stuck at converting the value of the Names key into a list that can be used in Pattern. The code I tried is
df1 = pd.DataFrame(data)
PriList=df1['Names'].str.split(";", n = 1, expand = True)
Pripat = '|'.join(r"\b{}\b".format(x) for x in PriList)
df['Match'] = df['MasterList'].str.findall('('+ Pripat + ')').str.join(', ')
The issue is with the Pripat. Its content is
\bApple, Orange\b
If I give the Names in a list like below
Prilist=['Apple','Orange','Papaya']
the code works fine...
Please help.

You'll need to call str.split and then flatten the result using itertools.chain.
First, do
df2 = df1.loc[df1.Type.eq('Fruit')]
Now,
from itertools import chain
prilist = list(chain.from_iterable(df2.Names.str.split(';').values))
There's also stack (which is slower):
prilist = df2.Names.str.split(';', expand=True).stack().tolist()
print(prilist)
['Apple', 'Orange', 'Papaya']

df2 = df1.loc[df1.Type.eq('Fruit')]
out_list=';'.join(df2['Names'].values).split(';')
#print(out_list)
['Apple', 'Orange', 'Papaya']

Related

Handling bad data that combines a quantity (int) and a object (str)

The dataframe "df_bool" in the below code produces "False" where it should produce "True". I believe this is because because some of my data contains an integer that represents quantities greater than one. I am not sure what the best solution is here? Perhaps separating the value from the object? How would I do that?
My intent is to produce a table that tell me when an object is present in df['Favorite fruits'] for each row.
import pandas as pd
def boolean_df(item_lists, unique_items):
# Create empty dict
bool_dict = {}
# Loop through all the tags
for i, item in enumerate(unique_items):
# Apply boolean mask
bool_dict[item] = item_lists.apply(lambda x: item in x)
# Return the results as a dataframe
return pd.DataFrame(bool_dict)
item_list = {'Name': ["kim", "colby", "ryan", "carter"], 'Favorite fruit': ["apple", ("orange", "grape"), ("3apple", "3grape"), ("2apple", "2orange")]}
df = pd.DataFrame(data=item_list)
unique_items = {"apple": "1", "orange": "2", "grape" : "3"}
df_bool=boolean_df(df['Favorite fruit'], (unique_items.keys()))
I have tried separating the values from the objects with no success.

pandas dataframe to json with each row encapsulated in a parent element

I'd like to use the to_json() function to serialize a pandas dataframe while encapsulating each row in a root 'Person' element.
import pandas as pd
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
df.to_json(orient='records')
'[{"Name":"tom","Age":10},{"Name":"nick","Age":15},{"Name":"juli","Age":14}]'
I'd like the to_json() output to be:
'[{"Person":{"Name":"tom","Age":10}},{"Person":{"Name":"nick","Age":15}},{"Person":{"Name":"juli","Age":14}}]'
I'm thinking this can be achieved with dataframe.apply() but haven't been able to figure it out.
Thx.

Use List Comprehension to create a list of dicts using df.to_dict:
In [4370]: d = [{'Person':i} for i in df.to_dict(orient='records')]
Convert above dict to json using json.dumps:
In [4372]: import json
In [4373]: j = json.dumps(d)
In [4374]: print(j)
Out[4373]: '[{"Person": {"Name": "tom", "Age": 10}}, {"Person": {"Name": "nick", "Age": 15}}, {"Person": {"Name": "juli", "Age": 14}}]'

I suppose you want to use Person as an index or identifier for each set. Otherwise, just include Person as a fixed string key for each nested dict would be redundant. If this is the case, you can use index inside the orient argument. In this case, it would append the index associated with the data frame.
import pandas as pd
>>> data = [['tom', 10], ['nick', 15], ['juli', 14]]
>>> df = pd.DataFrame(data, columns = ['Name', 'Age'])
>>> temp = [df.to_json(orient='index')]
>>> temp = ['{"0":{"Name":"tom","Age":10},"1":{"Name":"nick","Age":15},"2":{"Name":"juli","Age":14}}']
Also, you can adjust your index to whatever you want. I hope this is what you want.

How do I save result of multiple “for” loops into a dataframe?

How can I add outputs of different for loops into one dataframe. For example I have scraped data from website and have list of Names,Email and phone number using loops. I want to add all outputs into a table in single dataframe.
I am able to do it for One single loop but not for multiple loops.
Please look at the code and output in attached images.
By removing Zip from for loop its giving error. "Too many values to unpack"
Loop
phone = soup.find_all(class_ = "directory_item_phone directory_item_info_item")
for phn in phone:
print(phn.text.strip())
##Output - List of Numbers
Code for df
df = list()
for name,mail,phn in zip(faculty_name,email,phone):
df.append(name.text.strip())
df.append(mail.text.strip())
df.append(phn.text.strip())
df = pd.DataFrame(df)
df
For loops
Code and Output for df

An efficient way to create a pandas.DataFrame is to first create a dict and then convert it into a DataFrame.
In your case you probably could do :
import pandas as pd
D = {'name': [], 'mail': [], 'phone': []}
for name, mail, phn in zip(faculty_name, email, phone):
D['name'].append(name.text.strip())
D['mail'].append(mail.text.strip())
D['phone'].append(phn.text.strip())
df = pd.DataFrame(D)
Another way with a lambda function :
import pandas as pd
text_strip = lambda s : s.text.strip()
D = {
'name': list(map(text_strip, faculty_name)),
'mail': list(map(text_strip, email)),
'phone': list(map(text_strip, phone))
}
df = pd.DataFrame(D)
If lists don't all have the same length you may try this (but I am not sure that is very efficient) :
import pandas as pd
columns_names = ['name', 'mail', 'phone']
all_lists = [faculty_name, email, phone]
max_lenght = max(map(len, all_lists))
D = {c_name: [None]*max_lenght for c_name in columns_names}
for c_name, l in zip(columns_names , all_lists):
for ind, element in enumerate(l):
D[c_name][ind] = element
df = pd.DataFrame(D)

Try this,
data = {'name':[name.text.strip() for name in faculty_name],
'mail':[mail.text.strip() for mail in email],
'phn':[phn.text.strip() for phn in phone],}
df = pd.DataFrame.from_dict(data)

Convert CSV into JSON. How do I keep values with the same Index?

I am using this database: https://cpj.org/data/killed/?status=Killed&motiveConfirmed%5B%5D=Confirmed&type%5B%5D=Journalist&localOrForeign%5B%5D=Foreign&start_year=1992&end_year=2019&group_by=year
I have preprocessed it into this csv (showing only 2 lines of 159):
year,combinedStatus,fullName,sortName,primaryNationality,secondaryNationality,tertiaryNationality,gender,photoUrl,photoCredit,type,lastStatus,typeOfDeath,status,employedAs,organizations,jobs,coverage,mediums,country,location,region,state,locality,province,localOrForeign,sourcesOfFire,motiveConfirmed,accountabilityCrossfire,accountabilityAssignment,impunityMurder,tortured,captive,threatened,charges,motive,lengthOfSentence,healthProblems,impCountry,entry,sentenceDate,sentence,locationImprisoned
1994,Confirmed,Abdelkader Hireche,,,,,Male,,,Journalist,,Murder,Killed,Staff,Algerian Television (ENTV),Broadcast Reporter,Politics,Television,Algeria,Algiers,,,Algiers,,Foreign,,Confirmed,,,Partial Impunity,No,No,No,,,,,,,,,
2014,Confirmed,Ahmed Hasan Ahmed,,,,,Male,,,Journalist,,Dangerous Assignment,Killed,Staff,Xinhua News Agency,"Camera Operator,Photographer","Human Rights,Politics,War",Internet,Syria,Damascus,,,Damascus,,Foreign,,Confirmed,,,,,,,,,,,,,,,
And I want to make this type of JSON out of it:
"Afghanistan": {"year": 2001, "fullName": "Volker Handloik", "gender": "Male", "typeOfDeath": "Crossfire", "employedAs": "Freelance", "organizations": "freelance reporter", "jobs": "Print Reporter", "coverage": "War", "mediums": "Print", "photoUrl": NaN}, "Somalia": {"year": 1994, "fullName": "Pierre Anceaux", "gender": "Male", "typeOfDeath": "Murder", "employedAs": "Freelance", "organizations": "freelance", "jobs": "Broadcast Reporter", "coverage": "Human Rights", "mediums": "Television", "photoUrl": NaN}
The problem is that Afghanistan (as you can see in the link) has had many journalist deaths. I want to list all these killings under the Index 'Afghanistan'. However, as I currently do it, only the last case (Volker Handloik) in the csv file shows up. How can I get it so every case shows up?
this is my code atm
import pandas as pd
import pprint as pp
import json
# list with stand-ins for empty cells
missing_values = ["n/a", "na", "unknown", "-", ""]
# set missing values to NaN
df = pd.read_csv("data_journalists.csv", na_values = missing_values, skipinitialspace = True, error_bad_lines=False)
# columns
columns_keep = ['year', 'fullName', 'gender', 'typeOfDeath', 'employedAs', 'organizations', 'jobs', 'coverage', 'mediums', 'country', 'photoUrl']
small_df = df[columns_keep]
with pd.option_context('display.max_rows', None, 'display.max_columns', None): # more options can be specified also
print(small_df)
# create dict with country-column as index
df_dict = small_df.set_index('country').T.to_dict('dict')
print(df_dict)
# make json file from the dict
with open('result.json', 'w') as fp:
json.dump(df_dict, fp)
# use pretty print to see if dict matches the json example in the exercise
pp.pprint(df_dict)
I want to include all of these names (and more) in the JSON under the index Afghanistan
I think I will need a list of objects that is attached to the index of a country so that every country can show all the cases of journalists death instead of only 1 (each time being replaced by the next in the csv) I hope this is clear enough

I'll keep your code until the definition of small_df.
After that, we perform a groupby on the 'country' column and use pd.to_json on it:
country_series = small_df.groupby('country').apply(lambda r : r.drop(['country'], axis=1).to_json())
country_series is a pd.Series with the countries as index.
After that, we create a nested dictionary, so that we have a valid json object:
fullDict = {}
for ind, a in country_series.iteritems():
b = json.loads(a)
c = b['fullName']
smallDict = {}
for index, journalist in c.items():
smallDict[journalist] = {}
for i in b.keys():
smallDict[journalist][i] = b[i][index]
fullDict[ind] = (smallDict)
The nomenclature in my part of code is pretty bad, but I tried to write all the steps explicitly so that things should be clear.
Finally, we write the results to a file:
with open('result.json', 'w') as f:
json.dump(fullDict, f)

Replace values from pandas dataset with dictionary

I am extracting a column from excel document with pandas. After that, I want to replace for each row of the selected column, all keys contained in multiple dictionaries grouped in a list.
import pandas as pd
file_loc = "excelFile.xlsx"
df = pd.read_excel(file_loc, usecols = "C")
In this case, my dataframe is called by df['Q10'], this data frame has more than 10k rows.
Traditionally, if I want to replace a value in df I use;
df['Q10'].str.replace('val1', 'val1')
Now, I have a dictionary of words like:
mydic = [
{
'key': 'wasn't',
'value': 'was not'
}
{
'key': 'I'm',
'value': 'I am'
}
... + tons of line of key value pairs
]
Currently, I have created a function that iterates over "mydic" and replacer one by one all occurrences.
def replaceContractions(df, mydic):
for cont in contractions:
df.str.replace(cont['key'], cont['value'])
Next I call this function passing mydic and my dataframe:
replaceContractions(df['Q10'], contractions)
First problem: this is very expensive because mydic has a lot of item and data set is iterate for each item on it.
Second: It seems that doesn't works :(
Any Ideas?

Convert your "dictionary" to a more friendly format:
m = {d['key'] : d['value'] for d in mydic}
m
{"I'm": 'I am', "wasn't": 'was not'}
Next, call replace with the regex switch and pass m to it.
df['Q10'] = df['Q10'].replace(m, regex=True)
replace accepts a dictionary of key-replacement pairs, and it should be much faster than iterating over each key-replacement at a time.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Use JSON column into a Pattern using Pandas - python

df2 = df1.loc[df1.Type.eq('Fruit')] out_list=';'.join(df2['Names'].values).split(';') #print(out_list) ['Apple', 'Orange', 'Papaya']

Related

Handling bad data that combines a quantity (int) and a object (str)

pandas dataframe to json with each row encapsulated in a parent element

How do I save result of multiple “for” loops into a dataframe?

Convert CSV into JSON. How do I keep values with the same Index?

Replace values from pandas dataset with dictionary

Categories

Resources