Handling bad data that combines a quantity (int) and a object (str) - python

The dataframe "df_bool" in the below code produces "False" where it should produce "True". I believe this is because because some of my data contains an integer that represents quantities greater than one. I am not sure what the best solution is here? Perhaps separating the value from the object? How would I do that?
My intent is to produce a table that tell me when an object is present in df['Favorite fruits'] for each row.
import pandas as pd
def boolean_df(item_lists, unique_items):
# Create empty dict
bool_dict = {}
# Loop through all the tags
for i, item in enumerate(unique_items):
# Apply boolean mask
bool_dict[item] = item_lists.apply(lambda x: item in x)
# Return the results as a dataframe
return pd.DataFrame(bool_dict)
item_list = {'Name': ["kim", "colby", "ryan", "carter"], 'Favorite fruit': ["apple", ("orange", "grape"), ("3apple", "3grape"), ("2apple", "2orange")]}
df = pd.DataFrame(data=item_list)
unique_items = {"apple": "1", "orange": "2", "grape" : "3"}
df_bool=boolean_df(df['Favorite fruit'], (unique_items.keys()))
I have tried separating the values from the objects with no success.

Related

Convert CSV into JSON. How do I keep values with the same Index?

I am using this database: https://cpj.org/data/killed/?status=Killed&motiveConfirmed%5B%5D=Confirmed&type%5B%5D=Journalist&localOrForeign%5B%5D=Foreign&start_year=1992&end_year=2019&group_by=year
I have preprocessed it into this csv (showing only 2 lines of 159):
year,combinedStatus,fullName,sortName,primaryNationality,secondaryNationality,tertiaryNationality,gender,photoUrl,photoCredit,type,lastStatus,typeOfDeath,status,employedAs,organizations,jobs,coverage,mediums,country,location,region,state,locality,province,localOrForeign,sourcesOfFire,motiveConfirmed,accountabilityCrossfire,accountabilityAssignment,impunityMurder,tortured,captive,threatened,charges,motive,lengthOfSentence,healthProblems,impCountry,entry,sentenceDate,sentence,locationImprisoned
1994,Confirmed,Abdelkader Hireche,,,,,Male,,,Journalist,,Murder,Killed,Staff,Algerian Television (ENTV),Broadcast Reporter,Politics,Television,Algeria,Algiers,,,Algiers,,Foreign,,Confirmed,,,Partial Impunity,No,No,No,,,,,,,,,
2014,Confirmed,Ahmed Hasan Ahmed,,,,,Male,,,Journalist,,Dangerous Assignment,Killed,Staff,Xinhua News Agency,"Camera Operator,Photographer","Human Rights,Politics,War",Internet,Syria,Damascus,,,Damascus,,Foreign,,Confirmed,,,,,,,,,,,,,,,
And I want to make this type of JSON out of it:
"Afghanistan": {"year": 2001, "fullName": "Volker Handloik", "gender": "Male", "typeOfDeath": "Crossfire", "employedAs": "Freelance", "organizations": "freelance reporter", "jobs": "Print Reporter", "coverage": "War", "mediums": "Print", "photoUrl": NaN}, "Somalia": {"year": 1994, "fullName": "Pierre Anceaux", "gender": "Male", "typeOfDeath": "Murder", "employedAs": "Freelance", "organizations": "freelance", "jobs": "Broadcast Reporter", "coverage": "Human Rights", "mediums": "Television", "photoUrl": NaN}
The problem is that Afghanistan (as you can see in the link) has had many journalist deaths. I want to list all these killings under the Index 'Afghanistan'. However, as I currently do it, only the last case (Volker Handloik) in the csv file shows up. How can I get it so every case shows up?
this is my code atm
import pandas as pd
import pprint as pp
import json
# list with stand-ins for empty cells
missing_values = ["n/a", "na", "unknown", "-", ""]
# set missing values to NaN
df = pd.read_csv("data_journalists.csv", na_values = missing_values, skipinitialspace = True, error_bad_lines=False)
# columns
columns_keep = ['year', 'fullName', 'gender', 'typeOfDeath', 'employedAs', 'organizations', 'jobs', 'coverage', 'mediums', 'country', 'photoUrl']
small_df = df[columns_keep]
with pd.option_context('display.max_rows', None, 'display.max_columns', None): # more options can be specified also
print(small_df)
# create dict with country-column as index
df_dict = small_df.set_index('country').T.to_dict('dict')
print(df_dict)
# make json file from the dict
with open('result.json', 'w') as fp:
json.dump(df_dict, fp)
# use pretty print to see if dict matches the json example in the exercise
pp.pprint(df_dict)
I want to include all of these names (and more) in the JSON under the index Afghanistan
I think I will need a list of objects that is attached to the index of a country so that every country can show all the cases of journalists death instead of only 1 (each time being replaced by the next in the csv) I hope this is clear enough
I'll keep your code until the definition of small_df.
After that, we perform a groupby on the 'country' column and use pd.to_json on it:
country_series = small_df.groupby('country').apply(lambda r : r.drop(['country'], axis=1).to_json())
country_series is a pd.Series with the countries as index.
After that, we create a nested dictionary, so that we have a valid json object:
fullDict = {}
for ind, a in country_series.iteritems():
b = json.loads(a)
c = b['fullName']
smallDict = {}
for index, journalist in c.items():
smallDict[journalist] = {}
for i in b.keys():
smallDict[journalist][i] = b[i][index]
fullDict[ind] = (smallDict)
The nomenclature in my part of code is pretty bad, but I tried to write all the steps explicitly so that things should be clear.
Finally, we write the results to a file:
with open('result.json', 'w') as f:
json.dump(fullDict, f)

Use JSON column into a Pattern using Pandas

I have a JSON files Data. Given below is a sample of it.
[{
"Type": "Fruit",
"Names": "Apple;Orange;Papaya"
}, {
"Type": "Veggie",
"Names": "Cucumber;Spinach;Tomato"
}]
I have to read the Names and match each item of the Names with a column in another df.
I am stuck at converting the value of the Names key into a list that can be used in Pattern. The code I tried is
df1 = pd.DataFrame(data)
PriList=df1['Names'].str.split(";", n = 1, expand = True)
Pripat = '|'.join(r"\b{}\b".format(x) for x in PriList)
df['Match'] = df['MasterList'].str.findall('('+ Pripat + ')').str.join(', ')
The issue is with the Pripat. Its content is
\bApple, Orange\b
If I give the Names in a list like below
Prilist=['Apple','Orange','Papaya']
the code works fine...
Please help.
You'll need to call str.split and then flatten the result using itertools.chain.
First, do
df2 = df1.loc[df1.Type.eq('Fruit')]
Now,
from itertools import chain
prilist = list(chain.from_iterable(df2.Names.str.split(';').values))
There's also stack (which is slower):
prilist = df2.Names.str.split(';', expand=True).stack().tolist()
print(prilist)
['Apple', 'Orange', 'Papaya']
df2 = df1.loc[df1.Type.eq('Fruit')]
out_list=';'.join(df2['Names'].values).split(';')
#print(out_list)
['Apple', 'Orange', 'Papaya']

Import nested MongoDB to Pandas

I have a Collection with heavily nested docs in MongoDB, I want to flatten and import to Pandas. There are some nested dicts, but also a list of dicts that I want to transform into columns (see examples below for details).
I already have function, that works for smaller batches of documents. But the solution (I found it in the answer to this question) uses json. The problem with the json.loads operation is, that it fails with a MemoryError on bigger selections from the Collection.
I tried many solutions suggesting other json-parsers (e.g. ijson), but for different reasons none of them solved my problem. The only way left, if I want to keep up the transformation via json, would be chunking bigger selections into smaller groups of documents and iterate the parsing.
At this point I thought, - and that is my main question here - maybe there is a smarter way to do the unnesting without taking the detour through json directly in MongoDB or in Pandas or somehow combined?
This is a shortened example Doc:
{
'_id': ObjectId('5b40fcc4affb061b8871cbc5'),
'eventId': 2,
'sId' : 6833,
'stage': {
'value': 1,
'Name': 'FirstStage'
},
'quality': [
{
'type': {
'value': 2,
'Name': 'Color'
},
'value': '124'
},
{
'type': {
'value': 7,
'Name': 'Length'
},
'value': 'Short'
},
{
'type': {
'value': 15,
'Name': 'Printed'
}
}
}
This is what a succcesful dataframe-representation would look like (I skipped columns '_id' and 'sId' for readability:
eventId stage.value stage.name q_color q_length q_printed
1 2 1 'FirstStage' 124 'Short' 1
My code so far (which runs into memory problems - see above):
def load_events(filter = 'sId', id = 6833, all = False):
if all:
print('Loading all events.')
cursor = events.find()
else:
print('Loading events with %s equal to %s.' %(filter, id))
print('Filtering...')
cursor = events.find({filter : id})
print('Loading...')
l = list(cursor)
print('Parsing json...')
sanitized = json.loads(json_util.dumps(l))
print('Parsing quality...')
for ev in sanitized:
for q in ev['quality']:
name = 'q_' + str(q['type']['Name'])
value = q.pop('value', 1)
ev[name] = value
ev.pop('quality',None)
normalized = json_normalize(sanitized)
df = pd.DataFrame(normalized)
return df
You don't need to convert the nested structures using json parsers. Just create your dataframe from the record list:
df = DataFrame(list(cursor))
and afterwards use pandas in order to unpack your lists and dictionaries:
import pandas
from itertools import chain
import numpy
df = pandas.DataFrame(t)
df['stage.value'] = df['stage'].apply(lambda cell: cell['value'])
df['stage.name'] = df['stage'].apply(lambda cell: cell['Name'])
df['q_']= df['quality'].apply(lambda cell: [(m['type']['Name'], m['value'] if 'value' in m.keys() else 1) for m in cell])
df['q_'] = df['q_'].apply(lambda cell: dict((k, v) for k, v in cell))
keys = set(chain(*df['q_'].apply(lambda column: column.keys())))
for key in keys:
column_name = 'q_{}'.format(key).lower()
df[column_name] = df['q_'].apply(lambda cell: cell[key] if key in cell.keys() else numpy.NaN)
df.drop(['stage', 'quality', 'q_'], axis=1, inplace=True)
I use three steps in order to unpack the nested data types. Firstly, the names and values are used to create a flat list of pairs (tuples). In the second step a dictionary based on the tuples takes keys from 1st and values from 2nd location of the tuples. Then all existing property names are extracted once using a set. Each property gets a new column using a loop. Inside the loop the values of each pair is mapped to the respective column cells.

Replace values from pandas dataset with dictionary

I am extracting a column from excel document with pandas. After that, I want to replace for each row of the selected column, all keys contained in multiple dictionaries grouped in a list.
import pandas as pd
file_loc = "excelFile.xlsx"
df = pd.read_excel(file_loc, usecols = "C")
In this case, my dataframe is called by df['Q10'], this data frame has more than 10k rows.
Traditionally, if I want to replace a value in df I use;
df['Q10'].str.replace('val1', 'val1')
Now, I have a dictionary of words like:
mydic = [
{
'key': 'wasn't',
'value': 'was not'
}
{
'key': 'I'm',
'value': 'I am'
}
... + tons of line of key value pairs
]
Currently, I have created a function that iterates over "mydic" and replacer one by one all occurrences.
def replaceContractions(df, mydic):
for cont in contractions:
df.str.replace(cont['key'], cont['value'])
Next I call this function passing mydic and my dataframe:
replaceContractions(df['Q10'], contractions)
First problem: this is very expensive because mydic has a lot of item and data set is iterate for each item on it.
Second: It seems that doesn't works :(
Any Ideas?
Convert your "dictionary" to a more friendly format:
m = {d['key'] : d['value'] for d in mydic}
m
{"I'm": 'I am', "wasn't": 'was not'}
Next, call replace with the regex switch and pass m to it.
df['Q10'] = df['Q10'].replace(m, regex=True)
replace accepts a dictionary of key-replacement pairs, and it should be much faster than iterating over each key-replacement at a time.

How to append to a dictionary of dictionaries

I'm trying to create a dictionary of dictionaries like this:
food = {"Broccoli": {"Taste": "Bad", "Smell": "Bad"},
"Strawberry": {"Taste": "Good", "Smell": "Good"}}
But I am populating it from an SQL table. So I've pulled the SQL table into an SQL object called "result". And then I got the column names like this:
nutCol = [i[0] for i in result.description]
The table has about 40 characteristics, so it is quite long.
I can do this...
foodList = {}
for id, food in enumerate(result):
addMe = {str(food[1]): {nutCol[id + 2]: food[2], nulCol[idx + 3]:
food[3] ...}}
foodList.update(addMe)
But this of course would look horrible and take a while to write. And I'm still working out how I want to build this whole thing so it's possible I'll need to change it a few times...which could get extremely tedious.
Is there a DRY way of doing this?
In order to make solution position independent you can make use of dict1.update(dict2). This simply merges dict2 with dict1.
In our case since we have dict of dict, we can use dict['key'] as dict1 and simply add any additional key,value pair as dict2.
Here is an example.
food = {"Broccoli": {"Taste": "Bad", "Smell": "Bad"},
"Strawberry": {"Taste": "Good", "Smell": "Good"}}
addthis = {'foo':'bar'}
Suppose you want to add addthis dict to food['strawberry'] , we can simply use,
food["Strawberry"].update(addthis)
Getting result:
>>> food
{'Strawberry': {'Taste': 'Good', 'foo': 'bar', 'Smell': 'Good'},'Broccoli': {'Taste': 'Bad', 'Smell': 'Bad'}}
>>>
Assuming that column 0 is what you wish to use as your key, and you do wish to build a dictionary of dictionaries, then its:
detail_names = [col[0] for col in result.description[1:]]
foodList = {row[0]: dict(zip(detail_names, row[1:]))
for row in result}
Generalising, if column k is your identity then its:
foodList = {row[k]: {col[0]: row[i]
for i, col in enumerate(result.description) if i != k}
for row in result}
(Here each sub dictionary is all columns other than column k)
addMe = {str(food[1]):dict(zip(nutCol[2:],food[2:]))}
zip will take two (or more) lists of items and pair the elements, then you can pass the result to dict to turn the pairs into a dictionary.

Categories

Resources