I have a .csv file with many rows and 3 columns: Date, Rep, and Sales. I would like to use Python to generate a new array that groups the data by Date and, for the given date, sorts the Reps by Sales. As an example, my input data looks like this:
salesData = [[201703,'Bob',3000], [201703,'Sarah',6000], [201703,'Jim',9000],
[201704,'Bob',8000], [201704,'Sarah',7000], [201704,'Jim',12000],
[201705,'Bob',15000], [201705,'Sarah',14000], [201705,'Jim',8000],
[201706,'Bob',10000], [201706,'Sarah',18000]]
My desired output would look like this:
sortedData = [[201703,'Jim', 'Sarah', 'Bob'], [201704,'Jim', 'Bob',
'Sarah'], [201705,'Bob', 'Sarah', 'Jim'], [201706, 'Sarah', 'Bob']]
I am new to Python, but I have searched quite a bit for a solution with no success. Most of my search results lead me to believe there may be an easy way to do this using pandas (which I have not used) or numpy (which I have used).
Any suggestions would be greatly appreciated. I am using Python 3.6.
Use Pandas!
import pandas as pd
salesData = [[201703, 'Bob', 3000], [201703, 'Sarah', 6000], [201703, 'Jim', 9000],
[201704, 'Bob', 8000], [201704, 'Sarah', 7000], [201704, 'Jim', 12000],
[201705, 'Bob', 15000], [201705, 'Sarah', 14000], [201705, 'Jim', 8000],
[201706, 'Bob', 10000], [201706, 'Sarah', 18000]]
sales_df = pd.DataFrame(salesData)
result = []
for name, group in sales_df.groupby(0):
sorted_df = group.sort_values(2, ascending=False)
result.append([name] + list(sorted_df[1]))
print(result)
Without pandas, you can try this one line answer:
sortedData = [[i]+[item[1] for item in salesData if item[0]==i] for i in sorted(set([item[0] for item in salesData]))]
EDIT:
You can do this to order each inner list by sales:
sortedData = [[i]+[item[1] for item in sorted(salesData, key=lambda x: -x[2]) if item[0]==i] for i in sorted(set([item[0] for item in salesData]))]
Note that sorted(salesData, key=lambda x: -x[2]) part performs the ordering
Related
I have a dictionary like so: {key_1: pd.Dataframe, key_2: pd.Dataframe, ...}.
Each of these dfs within the dictionary has a column called 'ID'.
Not all instances appear in each dataframe meaning that the dataframes are of different size.
Is there anyway I could combine these into one large dataframe?
Here's a minimal reproducible example of the data:
data1 = [{'ID': 's1', 'country': 'Micronesia', 'Participants':3},
{'ID':'s2', 'country': 'Thailand', 'Participants': 90},
{'ID':'s3', 'country': 'China', 'Participants': 36},
{'ID':'s4', 'country': 'Peru', 'Participants': 30}]
data2 = [{'ID': '1', 'country': 'Micronesia', 'Kids_per_participant':3},
{'ID':'s2', 'country': 'Thailand', 'Kids_per_participant': 9},
{'ID':'s3', 'country': 'China', 'Kids_per_participant': 39}]
data3= [{'ID': 's1', 'country': 'Micronesia', 'hair_style_rank':3},
{'ID':'s2', 'country': 'Thailand', 'hair_style_rank': 9}]
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df3 = pd.DataFrame(data3)
dict_example={'df1_key':df1,'df2_key':df2,'df3_key':df3}
pd.merge(dict_example.values(), on="ID", how="outer")
For a dict with arbitrary number of keys you could do this
i=list(dict_example.keys())
newthing = dict_example[i[0]]
for j in range(1,len(i)):
newthing = newthing.merge(dict_example[i[j]],on='ID', how = 'outer')
First make a list of your dataframes. Second create a first DataFrame. Then iterate through the rest of your DataFrames and merge each one after that. I did notice you have country for each ID, but it's not listing in your initial on statement. Do you want to join on country also? If so replace the merge above with this changing the join criteria to a list including country
newthing = newthing.merge(dict_example[i[j]],on=['ID','country'], how = 'outer')
Documents on merge
If you don't care about altering your DataFrames code could be shorter like this
for j in range(1,len(i)):
df1 = df1.merge(dict_example[i[j]],on=['ID','country'], how = 'outer')
All I want to do is to populate a list of unique customers with their corresponding year of birth. Most so to write it back to my df to fill those empty spaces. Both John and Mike's names appear twice on the list. John provided his year of birth the first time he purchased an item but did not do so the second time. Whereas Mike did the opposite of What John did. Below is a sample of my dataframe on customers transactions.
df = pd.DataFrame({ 'Date': [2020-06-01, 2020-06-01, 2020-06-01, 2020-06-19, 2020-06-20, 2020-06-22,
2020-06-24, 2020-06-25],
'cst_names': ['John', 'Mike', 'Ndara', 'John', 'Kasiku', 'Mike', 'Alter', 'Lee'],
'birth_year': [1979, '', 1977, '', 1980, 1986, 1986, 2000],
'Price': [2000, 300, 375, 800, 3000, 199, 250, 600] })
This is what I want to achieve:
unique_lst = {'John': 1979, 'Mike': 1986, 'Ndara': 1977, 'Kasiku': 1980, 'Alter': 1986, 'Lee':2000 }
Once I have this info, I want to write it back to my df and update the missing spaces
I tried using zip and set but I don't seem to get it right.
a_dict = dict(zip(df.cst_names, df.birth_year))
I tried a for loop and tuple but still i can't figure it out
I tried to delete empty dates first, then zipped.
I hope it works for you.
df_altered = df.drop(df[df['birth_year']==''].index)
a_dict = dict(zip(df_altered.cst_names, df_altered.birth_year))
a_dict
I did the following and it works, but I think it is quite dirty. In the first part I add by default the age "" and then I update it with the age I got. If you only want the non null birth then the other answer you received is good.
dict_ = {}
list_names = df.cst_names.unique()
for name in df.cst_names.unique():
dict_[name]=""
df = df[df["birth_year"]!=""]
for name in df.cst_names.unique():
dict_[name]= df.loc[df["cst_names"]==name, "birth_year"].values[0]
An apply version
result_dict=df[df['birth_year']!=''].groupby('cst_names').apply(lambda row: row['birth_year']).reset_index()[['cst_names','birth_year']].set_index('cst_names').T.to_dict('list')
I've got a json format list with some dictionaries within each list, it looks like the following:
[{"id":13, "name":"Albert", "venue":{"id":123, "town":"Birmingham"}, "month":"February"},
{"id":17, "name":"Alfred", "venue":{"id":456, "town":"London"}, "month":"February"},
{"id":20, "name":"David", "venue":{"id":14, "town":"Southampton"}, "month":"June"},
{"id":17, "name":"Mary", "venue":{"id":56, "town":"London"}, "month":"December"}]
The amount of entries within the list can be up to 100. I plan to present the 'name' for each entry, one result at a time, for those that have London as a town. The rest are of no use to me. I'm a beginner at python so I would appreciate a suggestion in how to go about this efficiently. I initially thought it would be best to remove all entries that don't have London and then I can go through them one by one.
I also wondered if it might be quicker to not filter but to cycle through the entire json and select the names of entries that have the town as London.
You can use filter:
data = [{"id":13, "name":"Albert", "venue":{"id":123, "town":"Birmingham"}, "month":"February"},
{"id":17, "name":"Alfred", "venue":{"id":456, "town":"London"}, "month":"February"},
{"id":20, "name":"David", "venue":{"id":14, "town":"Southampton"}, "month":"June"},
{"id":17, "name":"Mary", "venue":{"id":56, "town":"London"}, "month":"December"}]
london_dicts = filter(lambda d: d['venue']['town'] == 'London', data)
for d in london_dicts:
print(d)
This is as efficient as it can get because:
The loop is written in C (in case of CPython)
filter returns an iterator (in Python 3), which means that the results are loaded to memory one by one as required
One way is to use list comprehension:
>>> data = [{"id":13, "name":"Albert", "venue":{"id":123, "town":"Birmingham"}, "month":"February"},
{"id":17, "name":"Alfred", "venue":{"id":456, "town":"London"}, "month":"February"},
{"id":20, "name":"David", "venue":{"id":14, "town":"Southampton"}, "month":"June"},
{"id":17, "name":"Mary", "venue":{"id":56, "town":"London"}, "month":"December"}]
>>> [d for d in data if d['venue']['town'] == 'London']
[{'id': 17,
'name': 'Alfred',
'venue': {'id': 456, 'town': 'London'},
'month': 'February'},
{'id': 17,
'name': 'Mary',
'venue': {'id': 56, 'town': 'London'},
'month': 'December'}]
I have a set:
CompanyList={'Apple','LG','Samsung'}
and a pandas DataFrame:
sales=[{'name':'Samsung Korea','model':'S1'},
{'name':'Samsung Vienam','model':'J1'},
{'name':'LG America','model':'L1'}
]
df=pd.DataFrame(sales)
I'd like to go through the CompanyList, then generate new Sub-DataFrame from 'sales' DataFrame. The expected results are
dataSamsung = [{'name': 'Samsung', 'model': 'S1'},{'name': 'Samsung', 'model': 'J1'}]
dataLG = [{'name': 'LG', 'model': 'L1'}]
I tried:
customer={}
for i in companyList:
customer[i] = df[df.name.str.contains('i')]
but this gives me a wrong answer. Could you help me to fix this case?
Try apply:
df['name']=df['name'].apply(lambda x: [i for i in CompanyList if i in x][0])
apply with list comprehension.
I have a dataframe, where I want a unique id (rec_id) for every record.
Something like
picture of the troublesome df
I have been experiementing with rec_id=df.index, but index was not unique
Have tried to reset it with df.reset_index().
not good either.
Any suggestions are warmly welcomed.
BR Lasse
Try this:
ds = ds.assign(rec_id=np.arange(len(ds))).reset_index(drop=True)
Maybe something like this
import pandas as pd
data = {'name': ['Jova', 'Mimi', 'Taty', 'Jessica', 'Alex'],
'year': [2012, 2012, 2013, 2014, 2014],
'docs': [40, 24, 19, 2, 3]}
df = pd.DataFrame(data, index = ['bg', 'ny', 'sd', 'sp', 'la'])
print (df)
print (df.name.unique())
I solved it like this in lack of a prettier solution.
colle=ds.columns
ds=ds.values
ds=pd.DataFrame(ds)
ds.columns=colle
ds['rec_id']=ds.index