I have the following dataframe:
Region Name Price
0 ny A 53.00
1 ln B 52.23
2 ln B 51.20
3 tk C 50.50
I want to convert the data to a list within a list, and the name and price field into a dictionary.
Name field is repeated, but I would like to get the unique values. Then assign the prices to the key.
Something like this: [Region,{Name:Price}]
For example:
[[ny, {"A": array([53.00])}],[ln, {"B": array([52.23 , 51.20])}],[tk, {"C": array([50.50]}]]
Can anyone suggest me a way to execute it?
Thanks.
You could put Region, Name as a MultiIndex and export to_dict, something like that:
df = pd.DataFrame({
'Region': ['ny', 'ln', 'ln', 'tk'],
'Name': ['A', 'B', 'B', 'C'],
'Price': [53, 52, 51, 50]
})
# First, combined same values for Region/Name pair into a list
df_grouped = df.groupby(['Region', 'Name']).Price.apply(list).to_frame()
# Second, create a nice dictionary
df_grouped.groupby(level=0).apply(
lambda df: df.xs(df.name).to_dict()['Price']
).to_dict()
>>> {'ln': {'B': [52, 51]}, 'ny': {'A': [53]}, 'tk': {'C': [50]}}
Related
I am using a webservice that returns inference data about submitted images in the form of:
{'IMG_123.jpg' : [{'keyword': value, 'score': value}, {'keyword': value, 'score': value}, {'keyword': value, 'score': value}]}
Like this:
https://i.stack.imgur.com/FEDqU.png
I want to combine multiple queries into a single dataframe such that the columns are the names of the Images, and the indices are the "keyword" values with the datapoints being the value of the "score".
I have been able to transform the data into, I think, a more useable format using this code:
d={}
for k, v in mydict.items():
d[k] = [{i['keyword']:i['score']} for i in v]
print(pd.DataFrame(d['IMG_1221.JPG']).T)
But this returns: https://i.stack.imgur.com/c3R0l.png
I am not sure how to combine multiple images into the format I am looking for, and the above code does not format my columns in a useful way.
The service returns keyword values that are not consistent across all images, such that the returned list of dicts will be differing sizes and keys. I would like to have a NaN or 0 value for any keys that do not exist for a given image but do for other images in the dataframe.
Any help is much appreciated!
IIUC, you want something like this:
import pandas as pd
mydict = {'IMG_1.JPG': [
{'keyword': 'a', 'score': 1},
{'keyword': 'b', 'score': 2},
{'keyword': 'c', 'score': 3}]}
mydict2 = {'IMG_2.JPG': [
{'keyword': 'a', 'score': 1},
{'keyword': 'b', 'score': 2},
{'keyword': 'd', 'score': 3}]
}
mydicts = [mydict, mydict2]
df_all = pd.DataFrame()
for d in mydicts:
key = list(d.keys())[0]
df = pd.DataFrame(d[key]).set_index('keyword').rename(columns={'score':key})
df_all = pd.concat([df_all, df], axis=1)
print(df_all)
IMG_1.JPG IMG_2.JPG
keyword
a 1.0 1.0
b 2.0 2.0
c 3.0 NaN
d NaN 3.0
I am trying to create a dictionary from a DataFrame where the key sometimes has multiple values.
For example:
df
ID value
A 10
B 45
C 20
C 30
D 20
E 10
E 70
E 110
F 20
And I want the dictionary to look like:
dic = {'A': 10,
'B': 45,
'C':[20,30],
'D': 20,
'E': [10,70,110],
'F': 20}
I tried using the following code:
dic=df.set_index('ID').T.to_dict('list')
But it returned a dictionary with only one value per ID:
{'A': 10,
'B': 45,
'C': 30,
'D': 20,
'E': 110,
'F': 20}
I'm assuming the right way to go about it is with some kind of loop appending to an empty dictionary but I'm not sure what the proper syntax would be.
My actual DataFrame is much longer than this, so what would I use to convert the DataFrame to the dictionary?
Thanks!
example dataframe:
df = pd.DataFrame({'ID':['A', 'B', 'B'], 'value': [1,2,3]})
df_tmp = df.groupby('ID')['value'].apply(list).reset_index()
dict(zip(df_tmp['ID'], df_tmp['value']))
outputs
{'A': [1], 'B': [2, 3]}
SOLVED:
# Split and save all unique parts to separate CSV
for unique_part in df['Part'].unique():
df.loc[df['Part'] == unique_part].to_csv(f'Part_{unique_part}.csv')
I have a table containing production data on parts and the variables that were recorded during their production. I need to slice out all columns for unique part rows. I.E All columns for columns for part #1, #2, and #3 be slice and put into separate dataframes.
FORMAT:
Part | Variable1 | Variable 2 etc
1-----------X---------------X
1-----------X---------------X
2-----------X---------------X
2-----------X---------------X
2-----------X---------------X
2-----------X---------------X
2-----------X---------------X
2-----------X---------------X
2-----------X---------------X
3-----------X---------------X
3-----------X---------------X
3-----------X---------------X
I have already tried
Creating a dictionary to group by
dict = {k: v for k, v in df.groupby('Part')}
This didn't work because I couldn't properly convert from dict to DataFrame with the correct format
I also tried creating a variable to store all unique part numbers, I just don't know how to loop through the main dataframe to slice out each unique part row section
part_num = df['Part'].unique()
In summary, I need to create separate dataframes with all variable columns for each cluster of rows with unique part number ids.
You can groupby and then apply to turn each group into a list of dicts, and then turn the groupby into a dict where each key is the unique Part value.
Something like:
df = pd.DataFrame({
'Part': [1,1,1,3,3,2,2,2],
'other': ['a','b','c','d','e','f','g','h']
})
d = df.groupby('Part').apply(lambda d: d.to_dict('records')).to_dict()
print d
will print
{1: [{'Part': 1, 'other': 'a'},
{'Part': 1, 'other': 'b'},
{'Part': 1, 'other': 'c'}],
2: [{'Part': 2, 'other': 'f'},
{'Part': 2, 'other': 'g'},
{'Part': 2, 'other': 'h'}],
3: [{'Part': 3, 'other': 'd'}, {'Part': 3, 'other': 'e'}]}
Think you are on the right track with groupby
df = pd.DataFrame({"Part": [1, 1, 2, 2],
"Var1": [10, 11, 12, 13],
"Var2": [20, 21, 22, 23]})
dfg = df.groupby("Part")
df1 = dfg.get_group(1)
df2 = dfg.get_group(2)
What do you want to DO with the data? Do you really need to create a bunch of individual data frames? The example below loops through each group (each part #) and prints. You could use the same method to do something or get something from each group without creating individual data frames.
for grp in dfg.groups:
print(dfg.get_group(grp))
print()
Output:
Part Var1 Var2
0 1 10 20
1 1 11 21
Part Var1 Var2
2 2 12 22
3 2 13 23
Let say I have an excel sheet like this,
If I read this file in pandas, I can get Column1, Column2, Column3 as headers.
However, I want to know/create an output possibly a dictionary that is like this,
{Column1: 'A', Column2: 'B', Column3: 'C'}
The reason is I have another dictionary from master mapping file (that already had the references for each column done manually) that has all the references to each Column like this,
{Column1: 'A', Column2: 'B', Column3: 'C', Column4: 'D'}
This way, I can cross check keys and values and then if there is any mismatch, I can identify those mismatches. How can I get the original column name such as A for Column1 etc.. while reading a file into pandas?? Any ideas??
You can use dict with zip to map column names to letters. Assumes you have a maximum of 26 columns.
from string import ascii_uppercase
df = pd.DataFrame(np.arange(9).reshape(3, 3),
columns=['Column1', 'Column2', 'Column3'])
d = dict(zip(df.columns, ascii_uppercase))
print(d)
{'Column1': 'A', 'Column2': 'B', 'Column3': 'C'}
For more than 26 columns, you can adapt the itertools.product solution available in Repeating letters like excel columns?
You can use the Panadas rename method to replace the dataframe column names using your existing mapping dictionary:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html
import pandas as pd
df = pd.DataFrame({'Column1': [1, 2], 'Column2': [3, 4], 'Column3': [5, 6]})
existing_mapping = {'Column1': 'A', 'Column2': 'B', 'Column3': 'C', 'Column4': 'D'}
df = df.rename(columns=existing_mapping)
I'm looking to fill in missing values of one column with the mode of the value from another column. Let's say this is our data set (borrowed from Chris Albon):
import pandas as pd
import numpy as np
raw_data = {'first_name': ['Jake', 'Jake', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Smith', 'Ali', 'Milner', 'Cooze'],
'age': [42, np.nan, 36, 24, 73],
'sex': ['m', np.nan, 'f', 'm', 'f'],
'preTestScore': [4, np.nan, np.nan, 2, 3],
'postTestScore': [25, np.nan, np.nan, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex', 'preTestScore', 'postTestScore'])
df
I know we can fill in missing postTestScore with each sex's mean value of postTestScore with:
df["postTestScore"].fillna(df.groupby("sex")["postTestScore"].transform("mean"), inplace=True)
df
But how would we fill in missing sex with each first name's mode value of sex (obviously this is not politically correct, but as an example this was an easy data set to use). So for this example the missing sex value would be 'm' because there are two Jake's with the value 'm'. If there were a Jake with value 'f' it would still pick 'm' as the mode value because 2 > 1. It would be nice if you could do:
df["sex"].fillna(df.groupby("first_name")["sex"].transform("mode"), inplace=True)
df
I looked into value_counts and apply but couldn't find this specific case. My ultimate goal is to be able to look at one column and if that doesn't have a mode value then to look at another column for a mode value.
You need call the mode function with pd.Series.mode
df.groupby("first_name")["sex"].transform(pd.Series.mode)
Out[432]:
0 m
1 m
2 f
3 m
4 f
Name: sex, dtype: object