This is probably an easy one for the python pros. So, please forgive my naivety.
Here is my data:
0 xyz#tim.com 13239169023 jane bo
1 tim#tim.com 13239169023 lane ko
2 jim#jim.com 13239169023 john do
Here is what I get as output:
[{"email":"xyz#tim.com","phone_number":13239169023,"first_name":"jane","last_name":"bo"},{"email":"tim#tim.com","phone_number":13239169023,"first_name":"lane","last_name":"ko"},{"email":"jim#jim.com","phone_number":13239169023,"first_name":"john","last_name":"do"}]
My Code:
df = pd.read_csv('profiles.csv')
print(df)
data = df.to_json(orient="records")
print(data)
Output I want:
{"profiles":[{"email":"xyz#tim.com","phone_number":13239169023,"first_name":"jane","last_name":"bo"},{"email":"tim#tim.com","phone_number":13239169023,"first_name":"lane","last_name":"ko"},{"email":"jim#jim.com","phone_number":13239169023,"first_name":"john","last_name":"do"}]}
Adding below does NOT work.
output = {"profiles": data}
It adds single quotes on the data and profiles in NOT in double quotes (basically NOT a valid JSON), Like so:
{'profiles': '[{"email":"xyz#tim.com","phone_number":13239169023,"first_name":"jane","last_name":"bo"},{"email":"tim#tim.com","phone_number":13239169023,"first_name":"lane","last_name":"ko"},{"email":"jim#jim.com","phone_number":13239169023,"first_name":"john","last_name":"do"}]'}
You can use df.to_dict to output to a dictionary instead of a json-formatted string:
import pandas as pd
df = pd.read_csv('data.csv')
data = df.to_dict(orient='records')
output = {'profiles': data}
print(output)
Returns:
{'profiles': [{'0': 1, 'xyz#tim.com': 'tim#tim.com', '13239169023': 13239169023, 'jane': 'lane', 'bo': 'ko'}, {'0': 2, 'xyz#tim.com': 'jim#jim.com', '13239169023': 13239169023, 'jane': 'john', 'bo': 'do'}]}
I think I found a solution.
Changes:
data = df.to_dict(orient="records")
output = {}
output["profiles"] = data
Related
I have a dataframe contains strings of email addresses with the format:
d = {'Country':'A', 'Email':'123#abc.com,456#def.com,789#ghi.com'}
df = pd.DataFrame(data=d)
and I want the username of emails only. So the new dataframe should look like this:
d = {'Country':'A', 'Email':'123,456,789'}
df1 = pd.DataFrame(data=d)
The best way I could think of is to split the original string by comma, delete the domain part of emails and join the list back again. Are there better ways to this problem?
If you want a string as output, you can remove the part starting on #. Use str.replace with the #[^,]+ regex:
df['Email'] = df['Email'].str.replace(r'#[^,]+', '', regex=True)
Output:
Country Email
0 A 123,456,789
For a list you could use str.findall:
df['Email'] = df['Email'].str.findall(r'[^,]+(?=#)')
Output:
Country Email
0 A [123, 456, 789]
This is a regex question, not really a Pandas question but here's a solution that'll return a list (which you can join together as a string)
import re
df['Email'].apply(lambda s: re.findall('\w+(?=#)', s))
Output:
0 [123, 456, 789]
Name: Email, dtype: object
Try this
import pandas as pd
d = {'Country':['A', 'B'], 'Email':['123#abc.com,456#def.com,789#ghi.com', '134#abc.com,436#def.com,229#ghi.com']}
df = pd.DataFrame(d)
df['Email'] = df['Email'].str.findall('#\w+.com').apply(', '.join).str.replace('#','')
df
Output
Country Email
A abc.com, def.com, ghi.com
B abc.com, def.com, ghi.com
here is one way to do it
# iterate through the dictionary values are replace anything after # and , or '
d={k: re.sub(r"#[^,\']+",'' , v) for k, v in d.items()}
d
{'Country': 'A', 'Email': '123,456, 789'}
I have a request that gets me some data that looks like this:
[{'__rowType': 'META',
'__type': 'units',
'data': [{'name': 'units.unit', 'type': 'STRING'},
{'name': 'units.classification', 'type': 'STRING'}]},
{'__rowType': 'DATA', '__type': 'units', 'data': ['A', 'Energie']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['bar', ' ']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['CCM', 'Volumen']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['CDM', 'Volumen']}]
and would like to construct a (Pandas) DataFrame that looks like this:
Things like pd.DataFrame(pd.json_normalize(test)['data'] are close but still throw the whole list into the column instead of making separate columns. record_path sounded right but I can't get it to work correctly either.
Any help?
It's difficult to know how the example generalizes, but for this particular case you could use:
pd.DataFrame([d['data'] for d in test
if d.get('__rowType', None)=='DATA' and 'data' in d],
columns=['unit', 'classification']
)
NB. assuming test the input list
output:
unit classification
0 A Energie
1 bar
2 CCM Volumen
3 CDM Volumen
Instead of just giving you the code, first I explain how you can do this by details and then I'll show you the exact steps to follow and the final code. This way you understand everything for any further situation.
When you want to create a pandas dataframe with two columns you can do this by creating a dictionary and passing it to DataFrame class:
my_data = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=my_data)
This will result in this dataframe:
So if you want to have the dataframe you specified in your question the my_data dictionary should be like this:
my_data = {
'unit': ['A', 'bar', 'CCM', 'CDM'],
'classification': ['Energie', '', 'Volumen', 'Volumen'],
}
df = pd.DataFrame(data=my_data, )
df.index = np.arange(1, len(df)+1)
df
(You can see the df.index=... part. This is because that the index column of the desired dataframe is started at 1 in your question)
So if you want to do so you just have to extract these data from the data you provided and convert them to the exact dictionary mentioned above (my_data dictionary)
To do so you can do this:
# This will get the data values like 'bar', 'CCM' and etc from your initial data
values = [x['data'] for x in d if x['__rowType']=='DATA']
# This gets the columns names from meta data
meta = list(filter(lambda x: x['__rowType']=='META', d))[0]
columns = [x['name'].split('.')[-1] for x in meta['data']]
# This line creates the exact dictionary we need to send to DataFrame class.
my_data = {column:[v[i] for v in values] for i, column in enumerate(columns)}
So the whole code would be this:
d = YOUR_DATA
# This will get the data values like 'bar', 'CCM' and etc
values = [x['data'] for x in d if x['__rowType']=='DATA']
# This gets the columns names from meta data
meta = list(filter(lambda x: x['__rowType']=='META', d))[0]
columns = [x['name'].split('.')[-1] for x in meta['data']]
# This line creates the exact dictionary we need to send to DataFrame class.
my_data = {column:[v[i] for v in values] for i, column in enumerate(columns)}
df = pd.DataFrame(data=my_data, )
df.index = np.arange(1, len(df)+1)
df #or print(df)
Note: Of course you can do all of this in one complex line of code but to avoid confusion I decided to do this in couple of lines of code
For example, for the txt file of
Math, Calculus, 5
Math, Vector, 3
Language, English, 4
Language, Spanish, 4
into the dictionary of:
data={'Math':{'name':[Calculus, Vector], 'score':[5,3]}, 'Language':{'name':[English, Spanish], 'score':[4,4]}}
I am having trouble with appending value to create list inside the smaller dict. I'm very new to this and I would not understand importing command. Thank you so much for all your help!
For each line, find the 3 values, then add them to a dict structure
from pathlib import Path
result = {}
for row in Path("test.txt").read_text().splitlines():
subject_type, subject, score = row.split(", ")
if subject_type not in result:
result[subject_type] = {'name': [], 'score': []}
result[subject_type]['name'].append(subject)
result[subject_type]['score'].append(int(score))
You can simplify it with the use of a defaultdict that creates the mapping if the key isn't already present
result = defaultdict(lambda: {'name': [], 'score': []}) # from collections import defaultdict
for row in Path("test.txt").read_text().splitlines():
subject_type, subject, score = row.split(", ")
result[subject_type]['name'].append(subject)
result[subject_type]['score'].append(int(score))
With pandas.DataFrame you can directly the formatted data and output the format you want
import pandas as pd
df = pd.read_csv("test.txt", sep=", ", engine="python", names=['key', 'name', 'score'])
df = df.groupby('key').agg(list)
result = df.to_dict(orient='index')
From your data:
data={'Math':{'name':['Calculus', 'Vector'], 'score':[5,3]},
'Language':{'name':['English', 'Spanish'], 'score':[4,4]}}
If you want to append to the list inside your dictionary, you can do:
data['Math']['name'].append('Algebra')
data['Math']['score'].append(4)
If you want to add a new dictionary, you can do:
data['Science'] = {'name':['Chemisty', 'Biology'], 'score':[2,3]}
I am not sure if that is what you wanted but I hope it helps!
I have a dataframe df
df
Object Action Cost1 Cost2
0 123 renovate 10000 2000
1 456 do something 0 10
2 789 review 1000 50
and a dictionary (called dictionary)
dictionary
{'Object_new': ['Object'],
'Action_new': ['Action'],
'Total_Cost': ['Cost1', 'Cost2']}
Further, I have a (at the beginning empty) dataframe df_new that should contain almost the identicall information as df, except that the column names need to be different (naming according to the dictionary) and that some columns from df should be consolidated (e.g. a sum-operation) based on the dictionary.
The result should look like this:
df_new
Object_new Action_new Total_Cost
0 123 renovate 12000
1 456 do something 10
2 789 review 1050
How can I achieve this result using only the dictionary? I tried to use the .map() function but could not figure out how to perform the sum-operation with it.
The code to reproduce both dataframes and the dictionary are attached:
# import libraries
import pandas as pd
### create df
data_df = {'Object': [123, 456, 789],
'Action': ['renovate', 'do something', 'review'],
'Cost1': [10000, 0, 1000],
'Cost2': [2000, 10, 50],
}
df = pd.DataFrame(data_df)
### create dictionary
dictionary = {'Object_new':['Object'],
'Action_new':['Action'],
'Total_Cost' : ['Cost1', 'Cost2']}
### create df_new
# data_df_new = pd.DataFrame(columns=['Object_new', 'Action_new', 'Total_Cost' ])
data_df_new = {'Object_new': [123, 456, 789],
'Action_new': ['renovate', 'do something', 'review'],
'Total_Cost': [12000, 10, 1050],
}
df_new = pd.DataFrame(data_df_new)
A play with groupby:
inv_dict = {x:k for k,v in dictionary.items() for x in v}
df_new = df.groupby(df.columns.map(inv_dict),
axis=1).sum()
Output:
Action_new Object_new Total_Cost
0 renovate 123 12000
1 do something 456 10
2 review 789 1050
Given the complexity of your algorithm, I would suggest performing a Series addition operation to solve this problem.
Why? In Pandas, every column in a DataFrame works as a Series under the hood.
data_df_new = {
'Object_new': df['Object'],
'Action_new': df['Action'],
'Total_Cost': (df['Cost1'] + df['Cost2']) # Addition of two series
}
df_new = pd.DataFrame(data_df_new)
Running this code will map every value contained in your dataset, which will be stored in our dictionary.
You can use an empty data frame to copy the new column and use the to_dict to convert it to a dictionary.
import pandas as pd
import numpy as np
data_df = {'Object': [123, 456, 789],
'Action': ['renovate', 'do something', 'review'],
'Cost1': [10000, 0, 1000],
'Cost2': [2000, 10, 50],
}
df = pd.DataFrame(data_df)
print(df)
MyEmptydf = pd.DataFrame()
MyEmptydf['Object_new']=df['Object']
MyEmptydf['Action_new']=df['Action']
MyEmptydf['Total_Cost'] = df['Cost1'] + df['Cost2']
print(MyEmptydf)
dictionary = MyEmptydf.to_dict(orient="index")
print(dictionary)
you can run the code here:https://repl.it/repls/RealisticVillainousGlueware
If you trying to entirely avoid pandas and only use the dictionary this should solve it
Object = []
totalcost = []
action = []
for i in range(0,3):
Object.append(data_df['Object'][i])
totalcost.append(data_df['Cost1'][i]+data_df['Cost2'][i])
action.append(data_df['Action'][i])
dict2 = {'Object':Object, 'Action':action, 'TotalCost':totalcost}
How can I add outputs of different for loops into one dataframe. For example I have scraped data from website and have list of Names,Email and phone number using loops. I want to add all outputs into a table in single dataframe.
I am able to do it for One single loop but not for multiple loops.
Please look at the code and output in attached images.
By removing Zip from for loop its giving error. "Too many values to unpack"
Loop
phone = soup.find_all(class_ = "directory_item_phone directory_item_info_item")
for phn in phone:
print(phn.text.strip())
##Output - List of Numbers
Code for df
df = list()
for name,mail,phn in zip(faculty_name,email,phone):
df.append(name.text.strip())
df.append(mail.text.strip())
df.append(phn.text.strip())
df = pd.DataFrame(df)
df
For loops
Code and Output for df
An efficient way to create a pandas.DataFrame is to first create a dict and then convert it into a DataFrame.
In your case you probably could do :
import pandas as pd
D = {'name': [], 'mail': [], 'phone': []}
for name, mail, phn in zip(faculty_name, email, phone):
D['name'].append(name.text.strip())
D['mail'].append(mail.text.strip())
D['phone'].append(phn.text.strip())
df = pd.DataFrame(D)
Another way with a lambda function :
import pandas as pd
text_strip = lambda s : s.text.strip()
D = {
'name': list(map(text_strip, faculty_name)),
'mail': list(map(text_strip, email)),
'phone': list(map(text_strip, phone))
}
df = pd.DataFrame(D)
If lists don't all have the same length you may try this (but I am not sure that is very efficient) :
import pandas as pd
columns_names = ['name', 'mail', 'phone']
all_lists = [faculty_name, email, phone]
max_lenght = max(map(len, all_lists))
D = {c_name: [None]*max_lenght for c_name in columns_names}
for c_name, l in zip(columns_names , all_lists):
for ind, element in enumerate(l):
D[c_name][ind] = element
df = pd.DataFrame(D)
Try this,
data = {'name':[name.text.strip() for name in faculty_name],
'mail':[mail.text.strip() for mail in email],
'phn':[phn.text.strip() for phn in phone],}
df = pd.DataFrame.from_dict(data)