The dataframe "df_bool" in the below code produces "False" where it should produce "True". I believe this is because because some of my data contains an integer that represents quantities greater than one. I am not sure what the best solution is here? Perhaps separating the value from the object? How would I do that?
My intent is to produce a table that tell me when an object is present in df['Favorite fruits'] for each row.
import pandas as pd
def boolean_df(item_lists, unique_items):
# Create empty dict
bool_dict = {}
# Loop through all the tags
for i, item in enumerate(unique_items):
# Apply boolean mask
bool_dict[item] = item_lists.apply(lambda x: item in x)
# Return the results as a dataframe
return pd.DataFrame(bool_dict)
item_list = {'Name': ["kim", "colby", "ryan", "carter"], 'Favorite fruit': ["apple", ("orange", "grape"), ("3apple", "3grape"), ("2apple", "2orange")]}
df = pd.DataFrame(data=item_list)
unique_items = {"apple": "1", "orange": "2", "grape" : "3"}
df_bool=boolean_df(df['Favorite fruit'], (unique_items.keys()))
I have tried separating the values from the objects with no success.
I'd like to use the to_json() function to serialize a pandas dataframe while encapsulating each row in a root 'Person' element.
import pandas as pd
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
df.to_json(orient='records')
'[{"Name":"tom","Age":10},{"Name":"nick","Age":15},{"Name":"juli","Age":14}]'
I'd like the to_json() output to be:
'[{"Person":{"Name":"tom","Age":10}},{"Person":{"Name":"nick","Age":15}},{"Person":{"Name":"juli","Age":14}}]'
I'm thinking this can be achieved with dataframe.apply() but haven't been able to figure it out.
Thx.
Use List Comprehension to create a list of dicts using df.to_dict:
In [4370]: d = [{'Person':i} for i in df.to_dict(orient='records')]
Convert above dict to json using json.dumps:
In [4372]: import json
In [4373]: j = json.dumps(d)
In [4374]: print(j)
Out[4373]: '[{"Person": {"Name": "tom", "Age": 10}}, {"Person": {"Name": "nick", "Age": 15}}, {"Person": {"Name": "juli", "Age": 14}}]'
I suppose you want to use Person as an index or identifier for each set. Otherwise, just include Person as a fixed string key for each nested dict would be redundant. If this is the case, you can use index inside the orient argument. In this case, it would append the index associated with the data frame.
import pandas as pd
>>> data = [['tom', 10], ['nick', 15], ['juli', 14]]
>>> df = pd.DataFrame(data, columns = ['Name', 'Age'])
>>> temp = [df.to_json(orient='index')]
>>> temp = ['{"0":{"Name":"tom","Age":10},"1":{"Name":"nick","Age":15},"2":{"Name":"juli","Age":14}}']
Also, you can adjust your index to whatever you want. I hope this is what you want.
How can I add outputs of different for loops into one dataframe. For example I have scraped data from website and have list of Names,Email and phone number using loops. I want to add all outputs into a table in single dataframe.
I am able to do it for One single loop but not for multiple loops.
Please look at the code and output in attached images.
By removing Zip from for loop its giving error. "Too many values to unpack"
Loop
phone = soup.find_all(class_ = "directory_item_phone directory_item_info_item")
for phn in phone:
print(phn.text.strip())
##Output - List of Numbers
Code for df
df = list()
for name,mail,phn in zip(faculty_name,email,phone):
df.append(name.text.strip())
df.append(mail.text.strip())
df.append(phn.text.strip())
df = pd.DataFrame(df)
df
For loops
Code and Output for df
An efficient way to create a pandas.DataFrame is to first create a dict and then convert it into a DataFrame.
In your case you probably could do :
import pandas as pd
D = {'name': [], 'mail': [], 'phone': []}
for name, mail, phn in zip(faculty_name, email, phone):
D['name'].append(name.text.strip())
D['mail'].append(mail.text.strip())
D['phone'].append(phn.text.strip())
df = pd.DataFrame(D)
Another way with a lambda function :
import pandas as pd
text_strip = lambda s : s.text.strip()
D = {
'name': list(map(text_strip, faculty_name)),
'mail': list(map(text_strip, email)),
'phone': list(map(text_strip, phone))
}
df = pd.DataFrame(D)
If lists don't all have the same length you may try this (but I am not sure that is very efficient) :
import pandas as pd
columns_names = ['name', 'mail', 'phone']
all_lists = [faculty_name, email, phone]
max_lenght = max(map(len, all_lists))
D = {c_name: [None]*max_lenght for c_name in columns_names}
for c_name, l in zip(columns_names , all_lists):
for ind, element in enumerate(l):
D[c_name][ind] = element
df = pd.DataFrame(D)
Try this,
data = {'name':[name.text.strip() for name in faculty_name],
'mail':[mail.text.strip() for mail in email],
'phn':[phn.text.strip() for phn in phone],}
df = pd.DataFrame.from_dict(data)
I am using this database: https://cpj.org/data/killed/?status=Killed&motiveConfirmed%5B%5D=Confirmed&type%5B%5D=Journalist&localOrForeign%5B%5D=Foreign&start_year=1992&end_year=2019&group_by=year
I have preprocessed it into this csv (showing only 2 lines of 159):
year,combinedStatus,fullName,sortName,primaryNationality,secondaryNationality,tertiaryNationality,gender,photoUrl,photoCredit,type,lastStatus,typeOfDeath,status,employedAs,organizations,jobs,coverage,mediums,country,location,region,state,locality,province,localOrForeign,sourcesOfFire,motiveConfirmed,accountabilityCrossfire,accountabilityAssignment,impunityMurder,tortured,captive,threatened,charges,motive,lengthOfSentence,healthProblems,impCountry,entry,sentenceDate,sentence,locationImprisoned
1994,Confirmed,Abdelkader Hireche,,,,,Male,,,Journalist,,Murder,Killed,Staff,Algerian Television (ENTV),Broadcast Reporter,Politics,Television,Algeria,Algiers,,,Algiers,,Foreign,,Confirmed,,,Partial Impunity,No,No,No,,,,,,,,,
2014,Confirmed,Ahmed Hasan Ahmed,,,,,Male,,,Journalist,,Dangerous Assignment,Killed,Staff,Xinhua News Agency,"Camera Operator,Photographer","Human Rights,Politics,War",Internet,Syria,Damascus,,,Damascus,,Foreign,,Confirmed,,,,,,,,,,,,,,,
And I want to make this type of JSON out of it:
"Afghanistan": {"year": 2001, "fullName": "Volker Handloik", "gender": "Male", "typeOfDeath": "Crossfire", "employedAs": "Freelance", "organizations": "freelance reporter", "jobs": "Print Reporter", "coverage": "War", "mediums": "Print", "photoUrl": NaN}, "Somalia": {"year": 1994, "fullName": "Pierre Anceaux", "gender": "Male", "typeOfDeath": "Murder", "employedAs": "Freelance", "organizations": "freelance", "jobs": "Broadcast Reporter", "coverage": "Human Rights", "mediums": "Television", "photoUrl": NaN}
The problem is that Afghanistan (as you can see in the link) has had many journalist deaths. I want to list all these killings under the Index 'Afghanistan'. However, as I currently do it, only the last case (Volker Handloik) in the csv file shows up. How can I get it so every case shows up?
this is my code atm
import pandas as pd
import pprint as pp
import json
# list with stand-ins for empty cells
missing_values = ["n/a", "na", "unknown", "-", ""]
# set missing values to NaN
df = pd.read_csv("data_journalists.csv", na_values = missing_values, skipinitialspace = True, error_bad_lines=False)
# columns
columns_keep = ['year', 'fullName', 'gender', 'typeOfDeath', 'employedAs', 'organizations', 'jobs', 'coverage', 'mediums', 'country', 'photoUrl']
small_df = df[columns_keep]
with pd.option_context('display.max_rows', None, 'display.max_columns', None): # more options can be specified also
print(small_df)
# create dict with country-column as index
df_dict = small_df.set_index('country').T.to_dict('dict')
print(df_dict)
# make json file from the dict
with open('result.json', 'w') as fp:
json.dump(df_dict, fp)
# use pretty print to see if dict matches the json example in the exercise
pp.pprint(df_dict)
I want to include all of these names (and more) in the JSON under the index Afghanistan
I think I will need a list of objects that is attached to the index of a country so that every country can show all the cases of journalists death instead of only 1 (each time being replaced by the next in the csv) I hope this is clear enough
I'll keep your code until the definition of small_df.
After that, we perform a groupby on the 'country' column and use pd.to_json on it:
country_series = small_df.groupby('country').apply(lambda r : r.drop(['country'], axis=1).to_json())
country_series is a pd.Series with the countries as index.
After that, we create a nested dictionary, so that we have a valid json object:
fullDict = {}
for ind, a in country_series.iteritems():
b = json.loads(a)
c = b['fullName']
smallDict = {}
for index, journalist in c.items():
smallDict[journalist] = {}
for i in b.keys():
smallDict[journalist][i] = b[i][index]
fullDict[ind] = (smallDict)
The nomenclature in my part of code is pretty bad, but I tried to write all the steps explicitly so that things should be clear.
Finally, we write the results to a file:
with open('result.json', 'w') as f:
json.dump(fullDict, f)
I am extracting a column from excel document with pandas. After that, I want to replace for each row of the selected column, all keys contained in multiple dictionaries grouped in a list.
import pandas as pd
file_loc = "excelFile.xlsx"
df = pd.read_excel(file_loc, usecols = "C")
In this case, my dataframe is called by df['Q10'], this data frame has more than 10k rows.
Traditionally, if I want to replace a value in df I use;
df['Q10'].str.replace('val1', 'val1')
Now, I have a dictionary of words like:
mydic = [
{
'key': 'wasn't',
'value': 'was not'
}
{
'key': 'I'm',
'value': 'I am'
}
... + tons of line of key value pairs
]
Currently, I have created a function that iterates over "mydic" and replacer one by one all occurrences.
def replaceContractions(df, mydic):
for cont in contractions:
df.str.replace(cont['key'], cont['value'])
Next I call this function passing mydic and my dataframe:
replaceContractions(df['Q10'], contractions)
First problem: this is very expensive because mydic has a lot of item and data set is iterate for each item on it.
Second: It seems that doesn't works :(
Any Ideas?
Convert your "dictionary" to a more friendly format:
m = {d['key'] : d['value'] for d in mydic}
m
{"I'm": 'I am', "wasn't": 'was not'}
Next, call replace with the regex switch and pass m to it.
df['Q10'] = df['Q10'].replace(m, regex=True)
replace accepts a dictionary of key-replacement pairs, and it should be much faster than iterating over each key-replacement at a time.