I'm parsing data in a loop and once it's parsed and structured I would like to then add it to a data frame.
The end format of the data frame I would like is something like the following:
df:
id 2018-01 2018-02 2018-03
234 2 1 3
345 4 5 1
534 5 3 4
234 2 2 3
When I iterate through the data in the loop I have a dictionary with the id, the month and the value for the month, for example:
{'id':234,'2018-01':2}
{'id':534,'2018-01':5}
{'id':534,'2018-03':4}
.
.
.
What is the best way to take an empty data frame and add rows and columns with their values to it in a loop?
Essentially as I iterate it would look something like this
df:
id 2018-01
234 2
then
df:
id 2018-01
234 2
534 5
then
df:
id 2018-01 2018-03
234 2
534 5 4
and so on...
IIUC, you need to convert the single dict to dataframe firstly, then we do append, in case we do not have duplicate 'id' we need groupby get the first value
df=pd.DataFrame()
l=[{'id':234,'2018-01':2},
{'id':534,'2018-01':5},
{'id':534,'2018-03':4}]
for x in l:
df=df.append(pd.Series(x).to_frame().T.set_index('id')).groupby(level=0).first()
print(df)
2018-01
id
234 2
2018-01
id
234 2
534 5
2018-01 2018-03
id
234 2.0 NaN
534 5.0 4.0
It is not advisable to generate a new data frame at each iteration and append it, this is quite expensive. If your data is not too big and fits into memory, you can make a list of dictionaries first and then pandas allows you to simply do:
df = pd.DataFrame(your_list_of_dicts)
df.set_index('id')
If making a list is to expensive (because you'd like to save memory for the data frame) consider using a generator instead of a list. The basic anatomy of a generator function is this:
def datagen(your_input):
for item in your_input:
# your code to make a dict
yield dict
The generator object data = datagen(input) will not store the dicts but yields a dict at each iteration. It can generate items on demand. When you do pd.DataFrame(data), pandas will stream all the data and make a data frame. Generators can be used for data pipelines (like pipes in UNIX) and are very powerful for big data workflows. Be aware, however, that a generator object can be consumed only once, that is if you run pd.DataFrame(data) again, you will get an empty data frame.
The easiest way I've found in Pandas (although not intuitive) to iteratively append new data rows to a dataframe is using df.loc[ ] to reference the last (nonexistent) row, with len(df) as the index:
df.loc[ len(df) ] = [new, row, of, data]
This will "append" the new data row to the end of the dataframe in-place.
The above example is for an empty Dataframe with exactly 4 columns, such as:
df = pandas.DataFrame( columns=["col1", "col2", "col3", "col4"] )
df.loc[ ] indexing can insert data at any Row at all, whether or not it exists yet. It seems it will never give an IndexError, like an numpy.array or List would if you tried to assign to a nonexistent row.
For a brand-new, empty DataFrame, len(df) returns 0, and thus references the first, blank row, and then increases by one each time you add a row.
–––––
I do not know the speed/memory efficiency cost of this method, but it works great for my modest datasets (few thousand rows). At least from a memory perspective, I imagine that a large loop appending data to to the target DataFrame directly would use less memory than generating an intermediate List of duplicate data first, then generating a DataFrame from that list. Time "efficiency" could be a different question entirely, one for the other SO gurus to comment on.
–––––
However for the OP's specific case where you also requested to combine the columns if the data is for an existing identically-named column, you'd need som logic during your for loop.
Instead I would make the DataFrame "dumb" and just import the data as-is, repeating dates as they come, eg. your post-loop DataFrame would look like this, with simple column names describing the raw data:
df:
id date data
234 2018-01 2
534 2018-01 5
535 2018-03 4
(has two entries for the same date).
Then I would use the DataFrame's databasing functions to organize this data how you like, probably using some combination of df.unique() and df.sort(). Will look into that more later.
Related
I have a big data (30 milions rows).
Each table has id,date,value.
I need to go over each id and per these id get a list of values sorted by date so the first value is the list will be the older date.
Example:
ID DATE VALUE
1 02/03/2020 300
1 04/03/2020 200
2 04/03/2020 456
2 01/03/2020 300
2 05/03/2020 78
Desire table:
ID VALUE_LIST_ORDERED
1 [300,200]
2 [300,456,78]
I can do it by for loop, by apply but its not effictive and with milion of users it's not feasible.
I thought about using group by and sort the dates but I dont know of to make a list and if so, groupby on pandas df is the best way?
I would love to get some suggestions on how to do it and which kind of df/technology to use.
Thank you!
what you need to do is to order your data using pandas.dataframe.sort_values and then apply the groupby method
I don't have huge data set to test this code on, but I believe this would do the trick:
sorted = data.sort_values('DATE')
result = data.groupby('ID').VALUE.apply(np.array)
and since it's Python you can always put everything in one statement
print(data.sort_values('DATE').data.groupby('ID').VALUE.apply(np.array))
I am trying to pivot a pandas dataframe, but the data is following a strange format that I cannot seem to pivot. The data is structured as below:
Date, Location, Action1, Quantity1, Action2, Quantity2, ... ActionN, QuantityN
<date> 1 Lights 10 CFloor 1 ... Null Null
<date2> 2 CFloor 2 CWalls 4 ... CBasement 15
<date3> 2 CWalls 7 CBasement 4 ... NUll Null
Essentially, each action will always have a quantity attached to it (which may be 0), but null actions will never have a quantity (the quantity will just be null). The format I am trying to achieve is the following:
Lights CFloor CBasement CWalls
1 10 1 0 0
2 0 2 19 11
The index of the rows becomes the location while the columns become any unique action found across the multiple activity columns. When pulling the data together, the value of each row/column is the sum of each quantity associated with the action (i.e Action1 corresponds to Quantity1). Is there a way to do this with the native pandas pivot funciton?
My current code performs a ravel across all the activity columns to get a list of all unique activities. It will also grab all the unique locations from the Location column. Once I have the unique columns, I create an empty dataframe and fill it with zeros:
Lights CFloor CBasement CWalls
1 0 0 0 0
2 0 0 0 0
I then iterate back over the old data frame with the itertuples() method (I was told it was significantly faster than iterrows()) and populate the new dataframe. This empty dataframe acts as a template that is stored in memory and filled later.
#Creates a template from the dataframe
def create_template(df):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
activities = df[act_cols]
flat_acts = activities.values.ravel('K')
unique_locations = pd.unique(df['Location'])
unique_acts = pd.unique(flat_acts)
pivot_template = pd.DataFrame(index=unique_locations, columns=unique_acts).fillna(0)
return pivot_template
#Fills the template from the dataframe
def create_pivot(df, pivot_frmt):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
quant_cols = ['Quantity01', 'Quantity02', 'Quantity03', 'Quantity04']
for row in df.itertuples():
for act, quantity in zip(act_cols, quant_cols):
act_val = getattr(row, act)
if pd.notna(act_val):
quantity_val = getattr(row, quantity)
location = getattr(row, 'Location')
pivot_frmt.loc[location, act_val] += quantity_val
return pivot_frmt
While my solution works, it is incredibly slow when dealing with a large dataset and has taken 10 seconds or more to complete this type of operation. Any help would be greatly appreciated!
After experimenting with various pandas functions, such as melt and pivot on multiple columns simulatenously, I found a solution that worked for me:
For every quantity-activity pair, I build a partial frame of the final dataset and store it in a list. Once every pair has been addressed I will end up with multiple dataframes that all have the same row counts, but potentially different column counts. I solved this issue by simply concatenating the columns and if any columns are repeated, I then sum them to get the final result.
def test_pivot(df):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
quant_cols = ['Quantity01', 'Quantity02', 'Quantity03', 'Quantity04']
dfs = []
for act, quant in zip(act_cols, quant_cols):
partial = pd.crosstab(index=df['Location'], columns=df[act], values=df[quant], aggfunc=np.sum).fillna(0)
dfs.append(partial)
finalDf = pd.concat(dfs, axis=1)
finalDf = test.groupby(finalDf.columns, axis=1).sum()
return finalDf
There are two assumptions that I make during this approach:
The indexes maintain their order across all partial dataframes
There are an equivalent number of indexes across all partial dataframes
While this is probably not the most elegant solution, it achieves the desired result and reduced the time it took to process the data by a very significant margin (from 10s ~4k rows to 0.2s ~4k rows). If anybody has a better way to deal with this type of scenario and do the process outlined above in one shot, then I would love to see your response!
Hello I want to store a dataframe in another dataframe cell.
I have a data that looks like this
I have daily data which consists of date, steps, and calories. In addition, I have minute by minute HR data of a specific date. Obviously it would be easy to put the minute by minute data in 2 dimensional list but I'm fearing that would be harder to analyze later.
What would be the best practice when I want to have both data in one dataframe? Is it even possible to even nest dataframes?
Any better ideas ? Thanks!
Yes, it seems possible to nest dataframes but I would recommend instead rethinking how you want to structure your data, which depends on your application or the analyses you want to run on it after.
How to "nest" dataframes into another dataframe
Your dataframe containing your nested "sub-dataframes" won't be displayed very nicely. However, just to show that it is possible to nest your dataframes, take a look at this mini-example:
Here we have 3 random dataframes:
>>> df1
0 1 2
0 0.614679 0.401098 0.379667
1 0.459064 0.328259 0.592180
2 0.916509 0.717322 0.319057
>>> df2
0 1 2
0 0.090917 0.457668 0.598548
1 0.748639 0.729935 0.680409
2 0.301244 0.024004 0.361283
>>> df3
0 1 2
0 0.200375 0.059798 0.665323
1 0.086708 0.320635 0.594862
2 0.299289 0.014134 0.085295
We can make a main dataframe that includes these dataframes as values in individual "cells":
df = pd.DataFrame({'idx':[1,2,3], 'dfs':[df1, df2, df3]})
We can then access these nested datframes as we would access any value in any other dataframe:
>>> df['dfs'].iloc[0]
0 1 2
0 0.614679 0.401098 0.379667
1 0.459064 0.328259 0.592180
2 0.916509 0.717322 0.319057
I'm currently working on big chunk of data on clustering each of RFM_class. The rfm class have 125 distinct values ranging from 111 until 555, the total rows of my dataframe currently sampled into 10000 rows for trial purposes of the script.
The logic behind what i'm trying to do is, take each of the RFM_class (125 distinct values), and do clustering method for each subset of the RFM_class by looping them for each RFM_class to get the cluster_class column with an empty dataframe, and then append the value again to the empty dataframe. And the empty dataframe will be merged to my main table.
This is the snapshot of the main table, I shrinked into 4 columns only, the origin is 11 columns.
df_test
RFM_class customer_id num_orders recent_day amount_order
555 1 1489 0 18539000
555 2 72 3 1069000
145 3 13 591 1350000
555 4 208 0 2119000
445 5 40 9 698000
What i'm doing is not far enough until the clustering, so i'm really stuck in looping each of the RFM_class This is what i'm trying to do for the last couple of days , trying only to take each RFM_class
rfm_list = list(set(df_test['rfm']))
core_col = ['num_orders','recent_day','amount_order']
cl_class = []
for row in rfm_list:
a=pd.DataFrame(df_test[core_col][df_test.rfm==row],columns=core_col)
cl_class.append(a)
cl_class
but the result is not as expected, because doing append seems not adding a new rows inside my empty dataframe.
Are there any function to do this on pandas? currently using python 3.0
You can use groupby to cluster the values.
For eg: consider this example csv file, where you would like to group by column fruits:
Fruit,Date,Name,Number
Apples,10/6/2016,Bob,7
Apples,10/6/2016,Bob,8
Apples,10/6/2016,Mike,9
Apples,10/7/2016,Steve,10
Apples,10/7/2016,Bob,1
Oranges,10/7/2016,Bob,2
Oranges,10/6/2016,Tom,15
Oranges,10/6/2016,Mike,57
Oranges,10/6/2016,Bob,65
Oranges,10/7/2016,Tony,1
Grapes,10/7/2016,Bob,1
Grapes,10/7/2016,Tom,87
Grapes,10/7/2016,Bob,22
Grapes,10/7/2016,Bob,12
Grapes,10/7/2016,Tony,15
The sample code to iterate through clusters:
import pandas as pd;
df = pd.read_csv("filename.csv");
grouped = df.groupby("Fruit");
for name, group in grouped:
print(name);
Hope this helps.
I have the following pandas DataFrame, df.head():
userid followers experience fixed_date
0 12134 28266 Intermediate 2012-10-15
1 12134 28266 Intermediate 2012-10-15
2 91638 665 Missing 2012-10-15
3 148401 123 Professional 2012-10-15
4 5890 2436 Professional 2012-10-15
I'd like to make a new DataFrame where the rows are userid, the columns are fixed_date, and the values are a tuple of (followers,experience). As you can see, I have duplicate userid rows, which is the error I get when I try df.pivot(). But the number of followers can change at a later date, so I'd like to capture that for each userid.
I can give a little more background about the data. The rows are currently tweets, so a user can (and often) tweets more than one time in a given day. Therefore there will also be duplicate fixed_date because I disregard exact time of tweet (HH:MM:SS). In cases where the user tweeted multiple times in a given date, it would be great to group this into the cell value and make an array of tuples. If this is already asking for too much, it'd be more than okay to just have multiple columns of the same value. If that's not possible, I can also save the dates to a separate array and enumerate the columns 0..n. Just throwing out thoughts.
Any ideas?
It's not elegant but this will work:
df2 = pd.DataFrame(df.loc[:, ['followers', 'fixed_date']]) # New frame with just two cols
df2.index = df.userid # Set the index to user id
df2 = df2.drop_duplicates() # remove duplicate records
Or if you just want the tuples to pass to an array you could do this:
df.loc[:, ['userid', 'fixed_date', 'followers']].values
# array([[12134, '2012-10-15', 28266],
[12134, '2012-10-15', 28266],
[91638, '2012-10-15', 665],
[148401, '2012-10-15', 123],
[5890, '2012-10-15', 2436]], dtype=object)
Which you could use to pass as a sparse matrix or convert to a numpy ndarray/matrix.