How can I work with a dictionary of dataframes please? Or, is there a better way to get an overview of my data? If I have for example:
Fruit Qty Year
Apple 2 2016
Orange 1 2017
Mango 2 2016
Apple 9 2016
Orange 8 2015
Mango 7 2016
Apple 6 2016
Orange 5 2017
Mango 4 2015
Then I am trying to find out how many in total I get per year, for example:
2015 2016 2017
Apple 0 11 0
Orange 8 0 6
Mango 4 9 0
I have written some code but it might not be useful:
import pandas as pd
# Fruit Data
df_1 = pd.DataFrame({'Fruit':['Apple','Orange','Mango','Apple','Orange','Mango','Apple','Orange','Mango'], 'Qty': [2,1,2,9,8,7,6,5,4], 'Year': [2016,2017,2016,2016,2015,2016,2016,2017,2015]})
# Create a list of Fruits
Fruits = df_1.Fruit.unique()
# Break down the dataframe by Year
df_2015 = df_1[df_1['Year'] == 2015]
df_2016 = df_1[df_1['Year'] == 2016]
df_2017 = df_1[df_1['Year'] == 2017]
# Create a dataframe dictionary of Fruits
Dict_2015 = {elem : pd.DataFrame for elem in Fruits}
Dict_2016 = {elem : pd.DataFrame for elem in Fruits}
Dict_2017 = {elem : pd.DataFrame for elem in Fruits}
# Store the Qty for each Fruit x each Year
for Fruit in Dict_2015.keys():
Dict_2015[Fruit] = df_2015[:][df_2015.Fruit == Fruit]
for Fruit in Dict_2016.keys():
Dict_2016[Fruit] = df_2016[:][df_2016.Fruit == Fruit]
for Fruit in Dict_2017.keys():
Dict_2017[Fruit] = df_2017[:][df_2017.Fruit == Fruit]
You can use pandas.pivot_table.
res = df.pivot_table(index='Fruit', columns=['Year'], values='Qty',
aggfunc=np.sum, fill_value=0)
print(res)
Year 2015 2016 2017
Fruit
Apple 0 17 0
Mango 4 9 0
Orange 8 0 6
For guidance on usage, see How to pivot a dataframe.
jpp has already posted an answer in the format you wanted. However, since your question seemed like you are open to other views, I thought of sharing another way. Not exactly in the format you posted but this how I usually do it.
df = df.groupby(['Fruit', 'Year']).agg({'Qty': 'sum'}).reset_index()
This will look something like:
Fruit Year Sum
Apple 2015 0
Apple 2016 11
Apple 2017 0
Orange 2015 8
Orange 2016 0
Orange 2017 6
Mango 2015 4
Mango 2016 9
Mango 2017 0
Related
So I have a pandas data frame that is grouped by date and a particular category and has the sum of another column. What I would like to do is take the number for a particular category for a particular day and add it to the next day and then take that number and add it to the next day. For example, say the category is apples, the date is 5-26-2021 and the cost is $5. The next day, 5-27-2021 is $6. So 5-27-2021 should have a cost of $11. Then 5-28-2021 has a cost of $3 but it should be added to $11 so the cost should show up as $14. How can I go about doing this? There are multiple categories by the way besides just the apples. Thank you!
enter image description here
Expected Output:
(the output is not the most accurate and this data frame is not the most accurate so feel free to ask questions)
Use groupby then cumsum
data = [
[2021, 'apple', 1,],
[2022, 'apple', 2,],
[2021, 'banana', 3,],
[2022, 'cherry', 4],
[2022, 'banana', 5],
[2023, 'cherry', 6],
]
columns = ['date','category', 'cost']
df = pd.DataFrame(data, columns=columns)
>>> df
date category cost
0 2021 apple 1
1 2022 apple 2
2 2021 banana 3
3 2022 cherry 4
4 2022 banana 5
5 2023 cherry 6
df.sort_values(['category','date'], inplace=True)
df.reset_index(drop=True, inplace=True)
df['CostCsum'] = df.groupby(['category'])['cost'].cumsum()
date category cost CostCsum
0 2021 apple 1 1
1 2022 apple 2 3
2 2021 banana 3 3
3 2022 banana 5 8
4 2022 cherry 4 4
5 2023 cherry 6 10
This is part 2 question of the original problem.
Is there a way to find IDs that have both Apple and Strawberry, and then find the total length? and IDs that has only Apple, and IDS that has only Strawberry? BASED ON LOCATION
df:
ID Fruit Location
0 ABC Apple NY <-ABC has Apple and Strawberry
1 ABC Strawberry NY <-ABC has Apple and Strawberry
2 EFG Apple LA <-EFG has Apple only
3 XYZ Apple HOUSTON <-XYZ has Apple and Strawberry
4 XYZ Strawberry HOUSTON <-XYZ has Apple and Strawberry
5 CDF Strawberry BOSTON <-CDF has Strawberry
6 AAA Apple CHICAGO <-AAA has Apple only
Desired output:
IDs that has Apple and Strawberry:
NY 1
HOUSTON 1
IDs that has Apple only:
LA 1
CHICAGO 1
IDs that has Strawberry only:
BOSTON 1
The previous code was:
v = ['Apple','Strawberry']
out = df.groupby('ID')['Fruit'].apply(lambda x: set(x) == set(v)).sum()
print(out)
>>> 2
I tried the following but it did not work and gave me the same results
v = ['Apple','Strawberry']
out = df.groupby('ID', 'LOCATION')['Fruit'].apply(lambda x: set(x) == set(v)).sum()
print(out)
>>> 2
Thanks!
Inefficient solution using groupby and apply
x = df.groupby('ID').agg({ 'Fruit': lambda x: tuple(x), 'Location': 'first'})
y=x.groupby('Fruit')['Location'].value_counts()
y:
Fruit Location
(Apple,) CHICAGO 1
LA 1
(Apple, Strawberry) HOUSTON 1
NY 1
(Strawberry,) BOSTON 1
Name: Location, dtype: int64
for index in set(y.index.get_level_values(0)):
if len(index)==2:
print(f"IDs that has {index[0]} and {index[1]}:")
print(y.loc[index].to_string())
else:
print(f"IDs that has {index[0]} only:")
print(y.loc[index].to_string())
IDs that has Apple only:
Location
CHICAGO 1
LA 1
IDs that has Apple and Strawberry:
Location
HOUSTON 1
NY 1
IDs that has Strawberry only:
Location
BOSTON 1
I've searched for an answer for the following question but haven't found the answer yet. I have a large dataset like this small example:
df =
A B
1 I bought 3 apples in 2013
3 I went to the store in 2020 and got milk
1 In 2015 and 2019 I went on holiday to Spain
2 When I was 17, in 2014 I got a new car
3 I got my present in 2018 and it broke down in 2019
What I would like is to extract all the values of > 1950 and have this as an end result:
A B C
1 I bought 3 apples in 2013 2013
3 I went to the store in 2020 and got milk 2020
1 In 2015 and 2019 I went on holiday to Spain 2015_2019
2 When I was 17, in 2014 I got a new car 2014
3 I got my present in 2018 and it broke down in 2019 2018_2019
I tried to extract values first, but didn't get further than:
df["C"] = df["B"].str.extract('(\d+)').astype(int)
df["C"] = df["B"].apply(lambda x: re.search(r'\d+', x).group())
But all I get are error messages (I've only started python and working with texts a few weeks ago..). Could someone help me?
With single regex pattern (considering your comment "need the year it took place"):
In [268]: pat = re.compile(r'\b(19(?:[6-9]\d|5[1-9])|[2-9]\d{3})')
In [269]: df['C'] = df['B'].apply(lambda x: '_'.join(pat.findall(x)))
In [270]: df
Out[270]:
A B C
0 1 I bought 3 apples in 2013 2013
1 3 I went to the store in 2020 and got milk 2020
2 1 In 2015 and 2019 I went on holiday to Spain 2015_2019
3 2 When I was 17, in 2014 I got a new car 2014
4 3 I got my present in 2018 and it broke down in ... 2018_2019
Here's one way using str.findall and joining those items from the resulting lists that are greater than 1950::
s = df["B"].str.findall('\d+')
df['C'] = s.apply(lambda x: '_'.join(i for i in x if int(i)> 1950))
A B C
0 1 I bought 3 apples in 2013 2013
1 3 I went to the store in 2020 and got milk 2020
2 1 In 2015 and 2019 I went on holiday to Spain 2015_2019
3 2 When I was 17, in 2014 I got a new car 2014
4 3 I got my present in 2018 and it broke down in ... 2018_2019
Here's my dataset, (only one column)
Apr 1 09:14:55 i have apple
Apr 2 08:10:10 i have mango
There's the result I need
month date time message
Apr 1 09:14:55 i have apple
Apr 2 09:10:10 i have mango
This is what I've done
import pandas as pd
month = []
date = []
time = []
message = []
for line in dns_data:
month.append(line.split()[0])
date.append(line.split()[1])
time.append(line.split()[2])
df = pd.DataFrame(data={'month': month, 'date':date, 'time':time})
This is the output I get
month date time
0 Apr 1 09:14:55
1 Apr 2 09:10:10
How to display message column?
Use parameter n in Series.str.split for spliting by first 3 whitespaces, expand=True is for output DataFrame:
print (df)
col
0 Apr 1 09:14:55 i have apple
1 Apr 2 08:10:10 i have mango
df1 = df['col'].str.split(n=3, expand=True)
df1.columns=['month','date','time','message']
print (df1)
month date time message
0 Apr 1 09:14:55 i have apple
1 Apr 2 08:10:10 i have mango
Another solution with list comprehension:
c = ['month','date','time','message']
df1 = pd.DataFrame([x.split(maxsplit=3) for x in df['col']], columns=c)
print (df1)
month date time message
0 Apr 1 09:14:55 i have apple
1 Apr 2 08:10:10 i have mango
You could use Series.str.extractall with a regex pattern:
df = pd.DataFrame({'text': {0: 'Apr 1 09:14:55 i have apple', 1: 'Apr 2 08:10:10 i have mango'}})
df_new = (df.text.str
.extractall(r'^(?P<month>\w{3})\s?(?P<date>\d{1,2})\s?(?P<time>\d{2}:\d{2}:\d{2})\s?(?P<message>.*)$')
.reset_index(drop=True))
print(df_new)
month date time message
0 Apr 1 09:14:55 i have apple
1 Apr 2 08:10:10 i have mango
This may help you.
(?<Month>\w+)\s(?<Date>\d+)\s(?<Time>[\w:]+)\s(?<Message>.*)
Match 1
Month Apr
Date 1
Time 09:14:55
Message i have apple
Match 2
Month Apr
Date 2
Time 08:10:10
Message i have mango
https://rubular.com/r/1S4BcbDxPtlVxE
I've a dataframe which contains a list of tuples in one of its columns. I need to split the list tuples into corresponding columns. My dataframe df looks like as given below:-
A B
[('Apple',50),('Orange',30),('banana',10)] Winter
[('Orange',69),('WaterMelon',50)] Summer
The expected output should be:
Fruit rate B
Apple 50 winter
Orange 30 winter
banana 10 winter
Orange 69 summer
WaterMelon 50 summer
You can use DataFrame constructor with numpy.repeat and numpy.concatenate:
df1 = pd.DataFrame(np.concatenate(df.A), columns=['Fruit','rate']).reset_index(drop=True)
df1['B'] = np.repeat(df.B.values, df['A'].str.len())
print (df1)
Fruit rate B
0 Apple 50 Winter
1 Orange 30 Winter
2 banana 10 Winter
3 Orange 69 Summer
4 WaterMelon 50 Summer
Another solution with chain.from_iterable:
from itertools import chain
df1 = pd.DataFrame(list(chain.from_iterable(df.A)), columns=['Fruit','rate'])
.reset_index(drop=True)
df1['B'] = np.repeat(df.B.values, df['A'].str.len())
print (df1)
Fruit rate B
0 Apple 50 Winter
1 Orange 30 Winter
2 banana 10 Winter
3 Orange 69 Summer
4 WaterMelon 50 Summer
This should work:
fruits = []
rates = []
seasons = []
def create_lists(row):
tuples = row['A']
season = row['B']
for t in tuples:
fruits.append(t[0])
rates.append(t[1])
seasons.append(season)
df.apply(create_lists, axis=1)
new_df = pd.DataFrame({"Fruit" :fruits, "Rate": rates, "B": seasons})[["Fruit", "Rate", "B"]]
output:
Fruit Rate B
0 Apple 50 winter
1 Orange 30 winter
2 banana 10 winter
3 Orange 69 summer
4 WaterMelon 50 summer
You can do this in a chained operation:
(
df.apply(lambda x: [[k,v,x.B] for k,v in x.A],axis=1)
.apply(pd.Series)
.stack()
.apply(pd.Series)
.reset_index(drop=True)
.rename(columns={0:'Fruit',1:'rate',2:'B'})
)
Out[1036]:
Fruit rate B
0 Apple 50 Winter
1 Orange 30 Winter
2 banana 10 Winter
3 Orange 69 Summer
4 WaterMelon 50 Summer