Removing nested brackets from a pandas dataframe? - python

So I am trying to convert a .mat file into a dataframe in order to run some data analysis After converting it, I have a dataframe structure (see 1), but I have no idea how to remove the brackets from the objects in the dataframe. I have tried utilizing:
mdataframe['0'] = mdataframe['0'].str[0]
and
mdataframe['0'] = mdataframe['0'].str.get(0)
as an attempt to fix the 0th column to no avail. Any help and guidance would be appreciated.
Thank you!

Thank you for your question. It is indeed a very interesting subject.
Personally, I have never seen a problem like yours; nevertheless, it is quite straightforward to solve your DataFrame conversion problem.
First of all, you need two steps:
Have a squashing function that will be applied for each entry in your table (i.e., DataFrame). This function must act like a dimensional reducer. Since we don't know how many dimensions we are to expect in each cell of your table, this function has to be capable of calling itself (a recursive function).
apply the squashing function for each entry of your table, and return the converted table.
Therefore, by following steps 1 and 2, I have created a code snippet that generates a DataFrame similar to your example and squashes its cells accordingly.
Code Snippet
import numpy as np
import pandas as pd
from typing import Any, List
from numbers import Number
def generateDataFrameWithnestedListsInItsCells() -> pd.DataFrame:
df = pd.DataFrame.from_records([[[["a"]]], [["b"]], [[["c"]]], [["b"]]])
return df
def squashList(element:List[Number]) -> Any:
array = np.asanyarray(element, dtype=list)
array = np.ravel(element)
while np.ndim(array) > 1:
squashList(element)
return array[0]
if "__main__" == __name__:
df = generateDataFrameWithnestedListsInItsCells()
df2 = df.applymap(squashList)
Notice that the df instance has your nested lists, while
its converted form (i.e., the df2) has its correct entries.
I hope that this example helps you in your research.
Sincerely,

Related

New dataframe in Pandas based on specific values(a lot of them) from existing df

Good evening! I'm using pandas on Jupyter Notebook. I have a huge dataframe representing full history of posts of 26 channels in a messenger. It has a column "dialog_id" which represents in which dialog the message was sent(so, there can be only 26 unique values in the column, but there are more then 700k rows, and the df is sorted itself by time, not id, so it is kinda chaotic). I have to split this dataframe into 2 different(one will contain full history of 13 channels, and the other will contain history for the rest 13 channels). I know ids by which I have to split, they are random as well. For example, one is -1001232032465 and the other is -1001153765346.
The question is, how do I do it most elegantly and adequate?
I know I can do it somehow with df.loc[], but I don't want to put like 13 rows of df.loc[]. I've tried to use logical operators for this, like:
df1.loc[(df["dialog_id"] == '-1001708255880') & (df["dialog_id"] == '-1001645788710' )], but it doesn't work. I suppose I'm using them wrong. I expect a solution with any method creating a new df, with the use of logical operators. In verbal expression, I think it should sound like "put the row in a new df if the dialog_id is x, or dialog_id is y, or dialog_id is z, etc". Please help me!
The easiest way seems to be just setting up a query.
df = pd.DataFrame(dict(col_id=[1,2,3,4,], other=[5,6,7,8,]))
channel_groupA = [1,2]
channel_groupB = [3,4]
df_groupA = df.query(f'col_id == {channel_groupA}')
df_groupB = df.query(f'col_id == {channel_groupB}')

How to use python to fill specific data to column in excel based on information of the first column?

I have a problem with an excel file! and I want to automate it by using python script to complete a column based on the information of the first column: for example:
if data == 'G711Alaw 64k' or 'G711Ulaw 64k'
print('1-Jan) till find it == '2-Jan' then print('2-Jan') and so on.
befor automate
I need its looks like this after automate:
after automate
Is there anyone can help me to do solve this issue?
The file:
the excel file
Thanks a lot for your help.
Try this, pandas reads your jan-1 is datetime type, if you need to change it to a string you can set it directly in the code, the following code will directly assign the value read to the second column:
import pandas as pd
df = pd.read_excel("add_date_column.xlsx", engine="openpyxl")
sig = []
def t(x):
global sig
if not isinstance(x.values[0], str):
tmp_sig = x.values[0]
if tmp_sig not in sig:
sig = [tmp_sig]
x.values[1] = sig[-1]
return x
new_df = df.apply(t, axis=1)
new_df.to_excel("new.xlsx", index=False)
The concept is very simple :
If the value is date/time, copy to the [same row, next column].
If not, [same row, next column] is copied from [previous row, next
column].
You do not specifically need Python for this task. The excel formula for this would be;
=IF(ISNUMBER(A:A),A:A,B1)
Instead of checking if it is date/time, I took adavantage of the fact that the rest of the entries are alphanumeric (including both alphabets and numbers). This formula is applied on the new column.
Of course, you might already be in Python and just work within the same environment. So, here's the loop :
for i in range(len(df)):
if type(df["Orig. Codec"][i]) is datetime:
df["Column1"][i] = df["Orig. Codec"][i]
else:
df["Column1"][i] = df["Column1"][i-1]
There might be ways to lambda function for the same concept, not that I am aware of how to apply lambda and shift at the same time.

How to replace empty values with reference to another dataframe?

I have 2 data frames. One is reference table with columns: code and name. Other one is list of dictionaries. The second data frame has code filled up but some names as empty strings. I am thinking of performing 2 for loops to get to the dictionary. But, I am new to this so unsure how to get the value from reference table.
Started with something like this:
for i in sample:
for j in i:
if j['name']=='':
(j['code'])
I am unsure how to proceed with the code. I think there is a very simple way with .map() function. Can someone help?
Reference table:
enter image description here
Edit needed table:
enter image description here
It seems to me that in this particular case you're using Pandas only to work with Python data structures. If that's the case, it would make sense to ditch Pandas altogether and just use Python data structures - usually, it results in more idiomatic and readable code that often performs better than Pandas with dtype=object.
In any case, here's the code:
import pandas as pd
sample_name = pd.DataFrame(dict(code=[8, 1, 6],
name=['Human development',
'Economic managemen',
'Social protection and risk management']))
# We just need a Series.
sample_name = sample_name.set_index('code')['name']
sample = pd.Series([[dict(code=8, name='')],
[dict(code=1, name='')],
[dict(code=6, name='')]])
def fix_dict(d):
if not d['name']:
d['name'] = sample_name.at[d['code']]
return d
def fix_dicts(dicts):
return [fix_dict(d) for d in dicts]
result = sample.map(fix_dicts)

Returning a dataframe in Dask

Aim: To speed up applying a function row wise across a large data frame (1.9 million ~ rows)
Attempt: Using dask map_partitions where partitions == number of cores. I've written a function which is applied to each row, creates a dict containing a variable number of new values (between 1 and 55). This function works fine standalone.
Problem: I need a way to combine the output of each function into a final dataframe. I tried using df.append, where I'd append each dict to a new dataframe and return this dataframe. If I understand the Dask Docs, Dask should then combine them to one big DF. Unfortunately this line is tripping an error (ValueError: could not broadcast input array from shape (56) into shape (1)). Which leads me to believe it's something to do with the combine feature in Dask?
#Function to applied row wise down the dataframe. Takes a column (post) and new empty df.
def func(post,New_DF):
post = str(post)
scores = OtherFUNC.countWords(post)
scores['post'] = post
New_DF = New_DF.append(scores, ignore_index=True)
return(New_DF)
#Dask
dd.from_pandas(dataset,npartitions=nCores).\
map_partitions(
lambda df : df.apply(
lambda x : func(x.post,New_DF),axis=1)).\
compute(get=get)
I am not quite sure I completely understand your code in lieu of an MCVE but I think there is a bit of a misunderstanding here.
In this piece of code you take a row and a DataFrame and append one row to that DataFrame.
#Function to applied row wise down the dataframe. Takes a column (post) and new empty df.
def func(post,New_DF):
post = str(post)
scores = OtherFUNC.countWords(post)
scores['post'] = post
New_DF = New_DF.append(scores, ignore_index=True)
return(New_DF)
Instead of appending to New_DF, I would recommend just returning a pd.Series which df.apply concatenates into a DataFrame. That is because if you are appending to the same New_DF object in all nCores partitions, you are bound to run into trouble.
#Function to applied row wise down the dataframe. Takes a row and returns a row.
def tobsecret_func(row):
post = str(row.post)
scores = OtherFUNC.countWords(post)
scores['post'] = post
length_adjusted_series = pd.Series(scores).reindex(range(55))
return(length_adjusted_series)
Your error also suggests that as you wrote in your question, your function creates a variable number of values. If the pd.Series you return doesn't have the same shape and column names, then df.apply will fail to concatenate them into a pd.DataFrame. Therefore make sure you return a pd.Series of equal shape each time. This question shows you how to create pd.Series of equal length and index: Pandas: pad series on top or bottom
I don't know what kind of dict your OtherFUNC.countWords returns exactly, so you may want to adjust the line:
length_adjusted_series = pd.Series(scores).reindex(range(55))
As is, the line would return a Series with an index 0, 1, 2, ..., 54 and up to 55 values (if the dict originally had less than 55 keys, the remaining cells will contain NaN values).
This means after applied to a DataFrame, the columns of that DataFrame would be named 0, 1, 2, ..., 54.
Now you take your dataset and map your function to each partition and in each partition you apply it to the DataFrame using apply.
#Dask
dd.from_pandas(dataset,npartitions=nCores).\
map_partitions(
lambda df : df.apply(
lambda x : func(x.post,New_DF),axis=1)).\
compute(get=get)
map_partitions expects a function which takes as input a DataFrame and outputs a DataFrame. Your function is doing this by using a lambda function that basically calls your other function and applies it to a DataFrame, which in turn returns a DataFrame. This works but I highly recommend writing a named function which takes as input a DataFrame and outputs a DataFrame, it makes it easier for you to debug your code.
For example with a simple wrapper function like this:
df_wise(df):
return df.apply(tobsecret_func)
Especially as your code gets more complex, abstaining from using lambda functions that call non-trivial code like your custom func and instead making a simple named function can help you debug because the traceback will not just lead you to a line with a bunch of lambda functions like in your code but will also directly point to the named function df_wise, so you will see exactly where the error is coming from.
#Dask
dd.from_pandas(dataset,npartitions=nCores).\
map_partitions(df_wise,
meta=df_wise(dd.head())
).\
compute(get=get)
Notice that we just fed dd.head() to df_wise to create our meta-keyword which is similar to what Dask would do under the hood.
You are using dask.get, the synchronous scheduler which is why the whole New_DF.append(...) code could work, since you append to the DataFrame for each consecutive partition.
This does not give you any parallelism and thus will not work if you use one of the other schedulers, all of which parallelise your code.
The documentation also mentions the meta keyword argument, which you should supply to your map_partitions call, so dask knows what columns your DataFrame will have. If you don't do this, dask will first have to do a trial run of your function on one of the partitions and check what the shape of the output is before it can go ahead and do the other partitions. This can slow down your code by a ton if your partitions are large; giving the meta keyword bypasses this unnecessary computation for dask.

What is a proper idiom in pandas for creating a dataframes from the output of a apply function on a df?

Edit --- I've made some progress, and discovered the drop_duplicates method in pandas, which saves some custom duplicate removal functions I created.
This changes the question in a couple of ways, b/c it changes my initial requirements.
One of the operations I need to conduct is grabbing the latest feed entries --- the feed urls exist in a column in a data frame. Once I've done the apply I get feed objects back:
import pandas as pd
import feedparser
import datetime
df_check_feeds = pd.DataFrame({'account_name':['NYTimes', 'WashPo'],'feed_url':['http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml', 'http://feeds.washingtonpost.com/rss/homepage'], 'last_update':['2015-12-28 23:50:40', '2015-12-28 23:50:40']})
df_check_feeds["feeds_results"] = pd.DataFrame(df_check_feeds.feed_url.apply(lambda feed_url: feedparser.parse(feed_url)))
df_check_feeds["entries"] = df_check_feeds.feeds_results.apply(lambda x: x.entries)
So, now I'm stuck with the feed entries in the "entries" column, I'd like to create a two new data frames in one apply method, and concatenate the two frames immediately.
I've expressed the equivalent in a for loop:
frames_list = []
for index in df_check_feeds.index:
df_temp = pd.DataFrame(df_check_feeds.entries[index])
df_temp['account_name'] = df_check_feeds.ix[index,'account_name']
# some error checking on the info here
frames_list.append(df_temp)
df_total_results = pd.concat(frames_list)
df_total_results
I realize I could do this in a for loop (and indeed have written that), but I feel there is some better, more succinct pandas idiomatic way of writing this statement.
A more compact way could be:
df_total_results = df_check_feeds.groupby('account_name').apply(lambda x: pd.DataFrame(x['entries'].iloc[0]))

Categories

Resources