Getting a dataframe out of list of dict - python

import pandas as pd
list_sample = [{'name': 'A', 'fame': 0, 'data': {'date':['2021-01-01', '2021-02-01', '2021-03-01'],
'credit_score':[800, 890, 895],
'spend':[1500, 25000, 2400],
'average_spend':5000}},
{'name': 'B', 'fame': 1, 'data': {'date':['2022-01-01', '2022-02-01', '2022-03-01'],
'credit_score':[2800, 390, 8900],
'spend':[15000, 5000, 400],
'average_spend':3000}}]
df = pd.DataFrame()
for row in list_sample:
name = row['name']
fame = row['fame']
data = row['data']
df_temp = pd.DataFrame(data)
df_temp['name'] = name
df_temp['fame'] = fame
df = pd.concat([df, df_temp])
Above is how I am getting my dataframe. Above is a dummy example, but, the issue with above is when the size of list grow and when the number of entries in each data array grow. Above takes alot of time. May be concat is the issue or something else, is there any better way to do what I am doing above (better in terms of run time !)

One way of doing this is to flatten the nested data dictionary that's inside the list_sample dictionary. You can do this with json_normalize.
import pandas as pd
from pandas.io.json import json_normalize
df = pd.DataFrame(list_sample)
df = pd.concat([df.drop(['data'], axis=1), json_normalize(df['data'])], axis=1)

It looks like you don't care about normalizing the data column. If that's the case, you can just do df = pd.DataFrame(list_sample) to achieve the same result. I think you'd only need to do the kind of iterating you're doing if you wanted to normalize the data.

Combine all dicts in list_sample to fit a dataframe structure and concat them at once:
df = pd.concat([pd.DataFrame(d['data'] | {'name': d['name'], 'fame': d['fame']})
for d in list_sample])
print(df)
date credit_score spend average_spend name fame
0 2021-01-01 800 1500 5000 A 0
1 2021-02-01 890 25000 5000 A 0
2 2021-03-01 895 2400 5000 A 0
0 2022-01-01 2800 15000 3000 B 1
1 2022-02-01 390 5000 3000 B 1
2 2022-03-01 8900 400 3000 B 1

Related

Pandas create rows based on interval between to dates

I am trying to expand a dataframe containing a number of columns by creating rows based on the interval between two date columns.
For this I am currently using a method that basically creates a cartesian product, which works well on small datasets, but is not good in large sets because it is very inefficient.
This method will be used on a ~ 2-million row by 50 column Dataframe spanning multiple years from min to max date. The resulting dataset will be about 3 million rows, so a more effective approach is required.
I have not succeeded in finding an alternative method which is less resource intensive.
What would be the best approach for this?
My current method here:
from datetime import date
import pandas as pd
raw_data = {'id': ['aa0', 'aa1', 'aa2', 'aa3'],
'number': [1, 2, 2, 1],
'color': ['blue', 'red', 'yellow', "green"],
'date_start': [date(2022,1,1), date(2022,1,1), date(2022,1,7), date(2022,1,12)],
'date_end': [date(2022,1,2), date(2022,1,4), date(2022,1,9), date(2022,1,14)]}
df = pd.DataFrame(raw_data)
This gives the following result
Now to create a set containing all possible dates between the min and max date of the set:
df_d = pd.DataFrame({'date': pd.date_range(df['date_start'].min(), df['date_end'].max() + pd.Timedelta('1d'), freq='1d')})
This results in an expected frame containing all the possible dates
Finally to cross merge the original set with the date set and filter resulting rows based on start and end date per row
df_total = pd.merge(df, df_d,how='cross')
df = df_total[(df_total['date_start']<df_total['date']) & (df_total['date_end']>=df_total['date']) ]
This leads to the following final
This final dataframe is exactly what is needed.
Efficient Solution
d = df['date_end'].sub(df['date_start']).dt.days
df1 = df.reindex(df.index.repeat(d))
i = df1.groupby(level=0).cumcount() + 1
df1['date'] = df1['date_start'] + pd.to_timedelta(i, unit='d')
How it works?
Subtract start from end to calculate the number of days elapsed, then reindex the dataframe by repeating the index exactly elapsed number of days times. Now group df1 by index and use cumcount to create a sequential counter then create a timedelta series using this counter and add this with date_start to get the result
Result
id number color date_start date_end date
0 aa0 1 blue 2022-01-01 2022-01-02 2022-01-02
1 aa1 2 red 2022-01-01 2022-01-04 2022-01-02
1 aa1 2 red 2022-01-01 2022-01-04 2022-01-03
1 aa1 2 red 2022-01-01 2022-01-04 2022-01-04
2 aa2 2 yellow 2022-01-07 2022-01-09 2022-01-08
2 aa2 2 yellow 2022-01-07 2022-01-09 2022-01-09
3 aa3 1 green 2022-01-12 2022-01-14 2022-01-13
3 aa3 1 green 2022-01-12 2022-01-14 2022-01-14
I don't know if this is an approvement, here the pd.date_range only gets created for each start and end date in each row. the created list gets exploded and joined to the original df
from datetime import date
import pandas as pd
raw_data = {'id': ['aa0', 'aa1', 'aa2', 'aa3'],
'number': [1, 2, 2, 1],
'color': ['blue', 'red', 'yellow', "green"],
'date_start': [date(2022,1,1), date(2022,1,1), date(2022,1,7), date(2022,1,12)],
'date_end': [date(2022,1,2), date(2022,1,4), date(2022,1,9), date(2022,1,14)]}
df = pd.DataFrame(raw_data)
s = df.apply(lambda x: pd.date_range(x['date_start'], x['date_end'], freq='1d',inclusive='right').date,axis=1).explode()
df.join(s.rename('date'))

Identify modified rows from updated Dataframe

I collect data and analyze. In this case , there are a times data collected like yesterday or last week missing a value and might get updated when records are available at a later date, or a row value might change. I mean a row value might be modified, see sample dataframe:
First dataframe to receive
import pandas as pd
cars = {'Date': ['2020-09-11','2020-10-11','2021-01-12','2020-01-03', '2021-02-01'],
'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4','Mercedes'],
'Price': [22000,25000,27000,35000,45000],
'Mileage': [2000,'NAN',47000,3500,5000]
}
df = pd.DataFrame(cars, columns = ['Date','Brand', 'Price', 'Mileage'])
print (df)
Modification done on first dataframe
import pandas as pd
cars2 = {'Date': ['2020-09-11','2020-10-11','2021-01-12','2020-01-03', '2021-02-01'],
'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4','Mercedes'],
'Price': [22000,5000,27000,35000,45000],
'Mileage': [2000,100,47000,3500,600]
}
df2 = pd.DataFrame(cars2, columns = ['Date','Brand', 'Price', 'Mileage'])
print (df2)
Now I did like to know how I can select only rows modified from first dataframe. My expected output is only get rows which were modified at a later date . I have tried this but it gives me old rows too
df_diff = pd.concat([df,df2], sort=False).drop_duplicates(keep=False, inplace=False)
Expected output
import pandas as pd
cars3 = {'Date': ['2020-10-11', '2021-02-01'],
'Brand': ['Toyota Corolla','Mercedes'],
'Price': [5000,45000],
'Mileage': [100,600]
}
df3 = pd.DataFrame(cars3, columns = ['Date','Brand', 'Price', 'Mileage'])
print (df3)
Because there are same index and columns is possible use DataFrame.ne for compare for not equal and test if at least one row True by DataFrame.any and filter in boolean indexing:
df3 = df2[df.ne(df2).any(axis=1)]
print (df3)
Date Brand Price Mileage
1 2020-10-11 Toyota Corolla 5000 100
4 2021-02-01 Mercedes 45000 600

How to iterate through selected rows in pandas dataframe with conditions matching three rows?

if I have a sample dataframe like this:
>>> import pandas as pd
>>> a = [100,300,200,100,700,600,400,600]
>>> i = ["2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007"]
>>> df = pd.DataFrame(a, index = i, columns = {"gdp"})
>>> df
gdp
2000 100
2001 300
2002 200
2003 100
2004 700
2005 600
2006 400
2007 600
res=[]
I want to iterate through the rows, and the conditions are:
if row(x+1) - row(x) <0 & row (x+2)-row(x+1)<0
res.append(index[x])
So in this example, I would get a list of [2001, 2004]
I'm not sure how to write the code for this. Thank you!
I prefer non loop solution, because better performance - use Series.shift, subtract by Series.sub and compare with Series.lt for less, last filter by boolean indexing with DataFrame.loc, if need also filter by column name gdp:
s1 = df['gdp'].shift(-1)
s2 = df['gdp'].shift(-2)
m = s1.sub(df['gdp']).lt(0) & s2.sub(s1).lt(0)
out = df.loc[m, 'gdp']
print (out)
2001 300
2004 700
Name: gdp, dtype: int64

Have dataframe by pandas with columns: price, soldPrice, ProductId, and typeTransaction(sell/return). How could calculate Profit take into returns?

I want to calculate profit but I can not comprehend how could to do
it.I tried to group by
two columns but how to substruct one to another?
df[df.groupby(['productId','typeTransaction']).size().reset_index()]
enter image description here
Try this:
import pandas as pd
import numpy as np
data = {
'price': [1200, 1500, 2000, 3000],
'soldPrice': [1800, 2500, 2300, 5000],
'typeTransaction': ['sell', 'return', 'return', 'sell']
}
df = pd.DataFrame(data)
df['profit'] = np.where(df['typeTransaction'] == 'sell', df['soldPrice'] - df['price'], 0)
print(df)
Output:
price soldPrice typeTransaction profit
0 1200 1800 sell 600
1 1500 2500 return 0
2 2000 2300 return 0
3 3000 5000 sell 2000

Return Sum for Single Day using Pandas

I am working with a CSV file and I need to return the sum of data for a specific day. Thus far I have been able to break the code into this:
import panda as pd
df = pd.read_csv (r'C:Users\ern\Downloads\evergreen.csv')
sum_imps = df['money'].sum() #sum the total of the money column
sumimps_byday = df.groupby(['day'])[["money"]].sum() #groups the sum of the money column by day
Now all i need is to be able to take it one step further and rerun the sum of money for a a specific day of my choosing. I don't think this is too hard, just drawing a blank.
total_money = df.loc[df['day'] == '20/03/2019', 'money'].sum()
should do the trick.
For example,
import pandas as pd
df = pd.DataFrame({'day': ['20/03/2019', '21/03/2019', '20/03/2019'],
'money': [1, 5, 7]})
print(df)
print('Total money for 20/03/2019: ' + str(df.loc[df['day'] == '20/03/2019', 'money'].sum()))
should give the desired output
day money
0 20/03/2019 1
1 21/03/2019 5
2 20/03/2019 7
Total money for 20/03/2019: 8
Let us define df as follows:
import pandas as pd
df = pd.DataFrame(data = [['day1', 2900, 3000], ['day2', 3300, 3350], ['day3', 3200, 3150], ['day1', 3200, 3050]], columns = 'day, money, close'.split(', '))
Here is what df looks like.
df
>>>
day money close
0 day1 2900 3000
1 day2 3300 3350
2 day3 3200 3150
3 day1 3200 3050
Suppose I want to get the sum of money for day which we can easily see to be 6100, I would do the following.
df[df.day == 'day1']['money'].sum()
>>>6100

Categories

Resources