Building a dataframe in an efficient way from dictionary

Building a dataframe in an efficient way from dictionary - python

I have large set of data that I have process and generated a dictionary. Now I want to create a dataframe from this dictionary. Vales of the dictionary are list of tuples. From those values I need to find out the unique values to build the columns of the dataframe:
d = {'0001': [('skiing',0.789),('snow',0.65),('winter',0.56)],'0002': [('drama', 0.89),('comedy', 0.678),('action',-0.42) ('winter',-0.12),('kids',0.12)],'0003': [('action', 0.89),('funny', 0.58),('sports',0.12)],'0004': [('dark', 0.89),('Mystery', 0.678),('crime',0.12), ('adult',-0.423)],'0005': [('cartoon', -0.89),('comedy', 0.678),('action',0.12)],'0006': [('drama', -0.49),('funny', 0.378),('Suspense',0.12), ('Thriller',0.78)],'0007': [('dark', 0.79),('Mystery', 0.88),('crime',0.32), ('adult',-0.423)]}
(size of the dictionary close to 800,000 records)
I iterate over the dictionary to find out the unique headers:
col_headers = []
entities = []
for key, scores in d.iteritems():
entities.append(key)
d[key] = dict(scores)
col_headers.extend(d[key].keys())
col_headers = list(set(col_headers))
I believe this take long time to process. Using dict might also be an issue since its much slower. Further more when I construct the data frame raw by raw it further slows down the process:
df = pd.DataFrame(columns=col_headers, index=entities)
for k in d:
df.loc[k] = pd.Series(d[k])
df.fillna(0.0, axis=1)
How can I speed up this process to reduce to the process time?

#ajcr almost gets it.
But you probably also need to unwrap the internal key-value pairs into a dictionary along the way.
df = pd.DataFrame.from_dict({ k: dict(v) for k,v in d.items() },
orient="index").fillna(0)
Then optionally, if you want to homogenize the style of column titles:
df.columns = [c.lower() for c in df.columns]
If you wanted to go entirely crazy, you could then sort the columns:
df = df.sort(axis=1)

Related

Load CSV files in dictionary then make data frame for each csv file Python

I have multiple csv files
I was able to load them as data frames into dictionary by using keywords
# reading files into dataframes
csvDict = {}
for index, rows in keywords.iterrows():
eachKey = rows.Keyword
csvFile = "SRT21" + eachKey + ".csv"
csvDict[eachKey] = pd.read_csv(csvFile)
Now I have other functions to apply on every data frame's specific column.
on a single data frame the code would be like this
df['Cat_Frames'] = df['Cat_Frames'].apply(process)
df['Cat_Frames'] = df['Cat_Frames'].apply(cleandata)
df['Cat_Frames'] = df['Cat_Frames'].fillna(' ')
My question is how to loop through every data frame in the dictionary to apply those function?
I have tried
for item in csvDict.items():
df = pd.DataFrame(item)
df
and it gives me empty result
any solution or suggestion?

You can chain the applys like this:
for key, df in csvDict.items():
df['Cat_Frames'] = df['Cat_Frames'].apply(process).apply(cleandata).fillna(' ')

Items returns a tuple of key/value, so you should make your for loop actually say:
for key, value in csvDict.items():
print(df)
also you need to print the df if you aren't in jupyter

for key, value in csvDict.items():
df = pd.DataFrame(value)
df
I think this is how you should traverse the dictionary.

When there is no processing of data from one data set/frame involving another data set, don't collect data sets.
Just process the current one, and proceed to the next.
The conventional name for a variable receiving an "unpacked value" not going to be used is _:
for _, df in csvDict.items():
df['Cat_Frames'] = df['Cat_Frames'].apply(process).apply(…
- but why ask for keys to ignore them? Iterate the values:
for df in csvDict.values():
df['Cat_Frames'] = df['Cat_Frames'].apply(process).apply(…

Add column to pandas.Dataframe using key values from dictionary

I have the below dataframe:
And I have the below dictionary:
resource_ids_dict = {'Austria':1586023272, 'Bulgaria':1550004006, 'Croatia':1131119835, 'Denmark':1703440195,
'Finland':2005848983, 'France':1264698819, 'Germany':1907737079, 'Greece':2113941104,
'Italy':27898245, 'Netherlands':1832579427, 'Norway':1054291604, 'Poland':1188865122,
'Romania':270819662, 'Russia':2132391298, 'Serbia':1155274960, 'South Africa':635838568,
'Spain':52600180, 'Switzerland':842323896, 'Turkey':1716131192, 'UK':199152257}
I am using the above dictionary values to make calls to a vendor API. I then append all the return data into a dataframe df.
What I would like to do now is add a column after ID that is the dictionary keys of the dictionay values that lie in ResourceSetID.
I have had a look on the web, but haven't managed to find anything (probably due to my lack of accurate key word searches). Surely this should be a one-liner? I want avoid looping through the dataframe and the dictionary and mapping that way..

Use Series.map but first is necessary swap values with keys in dictionary:
d = {v:k for k, v in resource_ids_dict.items()}
#alternative
#d = dict(zip(resource_ids_dict.values(), resource_ids_dict.keys()))
df['new'] = df['ResourceSetID'].map(d)

Append to a pd.DataFrame, dynamically allocating any new columns

I'm wanting to aggregate some API responses into a DataFrame.
The request consistently returns a number of json key value pairs, lets say A,B,C. occasionally however it will return A,B,C,D.
I would like something comparible to SQL's OUTER JOIN, that will simply add the new row, whilst filling the corresponding previous columns as NULL or some other placeholder.
The pandas join options insist upon imposing a unique suffix for the side, I really don't want this.
Am I looking at this the wrong way?
If there is no easy solution, I could just select a subset of the consistently available columns but I really wanted to download the lot and do the processing as a separate stage.

You can use pandas.concat as it provides with all the functionality required for your problem. Let this toy problem illustrate the possible solution.
# This generates random data with some key and value pair.
def gen_data(_size):
import string
keys = list(string.ascii_uppercase)
return dict((k,[v]) for k,v in zip(np.random.choice(keys, _size),np.random.randint(1000, size=_size)))
counter = 0
df = pd.DataFrame()
while True:
if counter > 5:
break;
# Recieve the data
new_data = gen_data(5)
# Converting this to dataframe obj
new_data = pd.DataFrame(new_data)
# Appending this data to my stack
df = pd.concat((df, new_data), axis=0, sort=True)
counter += 1
df.reset_index(drop=True, inplace=True)
print(df.to_string())

Python---loop over a dictionary with dataframes

I have a dictionary of dataframes
list_of_dfs={'df1:Dataframe','df2:Dataframe','df3:Dataframe','df4:Dataframe'}
Each data frame contains the same variables (price, volume, price,"Sell/Purchase") that I want to manipulate to end up with a new subset of DataFrames. My new dataframes have to filter the variable called "Sell/Purchase" by the observations that have "Sell" in the variable.
sell=df[df["Sale/Purchase"]=="Sell"]
My question is how do I loop over the dictionary in order to get a new dictionary with this new subset?
I dont know how to write this command to do the loop. I know it has to start like this:
# Create an empty dictionary called new_dfs to hold the results
new_dfs = {}
# Loop over key-value pair
for key, df in list_of_dfs.items():
But then due to my small knowledge of looping over a dictionary of dataframes I dont know how to write the filter command. I would be really thankful if someone can help me.
Thanks in advance.

Try this,
dict_of_dfs={'df1':'Dataframe','df2':'Dataframe','df3':'Dataframe','df4':'Dataframe'}
# Create an empty dictionary called new_dfs to hold the results
new_dfs = {}
# Loop over key-value pair
for key, df in dict_of_dfs.items():
new_dfs[key] = df[df["Sale/Purchase"]=="Sell"]
Explanation:
new_dfs = {} # Here we have created a empty dictionary.
# dictionary contains keys and values.
# to add keys and values to our dictionary,
# we need to do it as shown below,
new_dfs[our_key_1] = our_value_2
new_dfs[our_key_2] = our_value_2
.
.
.

You can map a function:
lambda df: df[df["Sale/Purchase"] == "Sell"]
HOW:
Syntax = map(fun, iter)
map(lambda df: df[df["Sale/Purchase"] == "Sell"], list_of_dfs)
You can map it on the a list, or set
For dict:
df_dict = {k: df[df["Sale/Purchase"]=="Sell"] for k, df in list_of_dfs.items()}

Something like:
sells = {k: v for (k, v) in list_of_df.items() if v["Sale/Purchase"] == "Sell"}
This pattern is called dictionary comprehension. According to this question this is the fastest and most Pythonic approach.
You should provide an example of the data you are dealing with for more precise answer.

Turning a list of key/value pairs into a pandas dataframe stored in a HDFStore

There are questions similar to this, but none of them handle the case where my dataframe is inside an HDFStore.
I need to turn a list of timestamp/key/value items into dataframes and store it as several dataframes each indexed on the timestamp, and then save it in an HDFStore.
Example code:
from pandas import HDFStore
from pandas import DataFrame
store = HDFStore('xxx', driver="H5FD_CORE")
for i, k, v in ((0, 'x', 5), (1, 'y', 6)):
if k not in store:
store[k] = DataFrame()
store[k].set_value(i, 'value', v)
After this code runs, store['x'] remains empty.
>>> store['x']
Empty DataFrame
Columns: []
Index: []
So there is obviously some reason why that is not persisting, and it is also certainly the case that I just don't know how this stuff is supposed to work. I can certainly figure out the logic if I just understand how you append to tables/dataframes inside an HDFStore.
I could also just keep the dataframes in memory, in some kind of dictionary, and just assign them to to the HDFStore right at the end. I somehow had this misguided idea that doing it this way will save memory, perhaps I am wrong about that too.

I'd comment to get some clarification, but I don't have the rep yet. Without some more context, it's hard for me to say whether your approach is wise, but I'd inclined to say no in almost all cases. Correct me if I'm wrong, but what you're trying to do is:
Given a list of iterables: [(timeA, key1, value1), (timeB, key1, value2), (timeC, key2, value1)]
You would want two df's in the HDFStore, where:
store[key1] = DataFrame([value1, value2], index=[timeA, timeB])
store[key2] = DataFrame([value1], index=[timeC])
Correct?
If so, what I would recommend is some kind of "filtering" on your store key, creating dataframes, and then writing a whole dataframe to the store, like so:
dataTuples = [(0, 'x', 5), (1, 'y', 6), ...]
# initializing the dict of lists, which will become a dict of df's
sortedByStoreKey = {storeKey: [] for idx, storeKey, val in dataTuples}
for idx, storeKey, val in dataTuples:
sortedByStoreKey[storeKey].append([idx, storeKey]) # appending a 2-list to a list
# this can all be done with dict comprehensions but this is more legible imo
for storeKey, dfContents in sortedByStoreKey.items():
df = pd.DataFrame(dfContents, columns=['time', 'value'])
df['time'] = pd.to_datetime(df['time']) # make sure this is read as a pd.DatetimeIndex (as you said you wanted)
df.set_index('time', inplace=True)
sortedByStoreKey[storeKey] = df
# now we write full dataframes to HDFStore
with pd.HDFStore('xxx') as store:
for storeKey, df in sortedByStoreKey.values():
store[storeKey] = df
I'm quite confident there's a more efficient way to do this, both number-of-lines-wise and resources-wise, but this is what strikes me as the most pythonic. If the dataTuples object is HUGE (like >= RAM), then my answer may have to change.
Generally speaking, the idea here is to create each of the dataframes in full before writing to the store. As I'm finishing up here, I'm realizing that you can do what you've chosen as well, and the piece that you are missing is the need to specify the store with a table format, which enables appending. Granted, appending one row at a time is probably not a good idea.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Building a dataframe in an efficient way from dictionary - python

Related

Load CSV files in dictionary then make data frame for each csv file Python

Add column to pandas.Dataframe using key values from dictionary

Append to a pd.DataFrame, dynamically allocating any new columns

Python---loop over a dictionary with dataframes

Turning a list of key/value pairs into a pandas dataframe stored in a HDFStore

Categories

Resources