I have a very large list, so i will use the below as a reproducible example. I would like to unlist the following so i can use the keys of the dictionaries as columns to a dataframe.
[{'message':'Today is a sunny day.','comments_count':'45','id':
'1401305690071546_11252160039985938','created_time': '2020-02-29T13:43:46+0000'},
{'message':'Today is a cloudy day.','comments_count':'47','id':
'1401305690073586_11252160039985938','created_time': '2020-03-29T13:43:46+0000'}]
Desired output will be the following columns as a panda dataframe:
message comments_count id created_time
If it’s a list of dictionaries that you want to transform to data-frame you can just do the following:
df1 = pd.DataFrame(l)
# or
df2 = pd.DataFrame.from_dict(l)
the output of both use cases is:
print(df2)
print(df2.columns)
message ... created_time
0 Today is a sunny day. ... 2020-02-29T13:43:46+0000
1 Today is a cloudy day. ... 2020-03-29T13:43:46+0000
[2 rows x 4 columns]
Index(['message', 'comments_count', 'id', 'created_time'], dtype='object')
If you want to put all of the data into the dataframe:
import pandas as pd
my_container = [{'message':'Today is a sunny day.','comments_count':'45','id': '1401305690071546_11252160039985938','created_time': '2020-02-29T13:43:46+0000'}, {'message':'Today is a cloudy day.','comments_count':'47','id': '1401305690073586_11252160039985938','created_time': '2020-03-29T13:43:46+0000'}]
df = pd.DataFrame(my_container)
If you want an empty dataframe with the correct columns:
columns = set()
for d in my_container:
columns.update(d.keys())
df = pd.DataFrame(columns=columns)
You can iterate through the list and find the type() of each item
dictList = []
for i in myList:
if type(i) == dict:
dictList.append(i)
myList.remove(i)
Related
I have a formatted data as dict[tuple[str, str], list[float]]
i want to convert it into a pandas dataframe
Example data:
{('A','B'): [-0.008035100996494293,0.008541940711438656]}
i tried using some data manipulations using split functions.
Expecting:-
import pandas as pd
data = {('A','B'): [-0.008035100996494293,0.008541940711438656], ('C','D'): [-0.008035100996494293,0.008541940711438656]}
title = []
heading = []
num_col1 = []
num_col2 = []
for key, val in data.items():
title.append(key[0])
heading.append(key[1])
num_col1.append(val[0])
num_col2.append(val[1])
data_ = {'title':title, 'heading':heading, 'num_col1':num_col1, 'num_col2':num_col1}
pd.DataFrame(data_)
Your best bet will be to construct your Index manually. For this we can use pandas.MultiIndex.from_tuples since your dictionary keys are stored as tuples. From there we just need to store the values of the dictionary into the body of a DataFrame.
import pandas as pd
data = {('A','B'): [-0.008035100996494293,0.008541940711438656]}
index = pd.MultiIndex.from_tuples(data.keys(), names=['title', 'heading'])
df = pd.DataFrame(data.values(), index=index).reset_index()
print(df)
title heading 0 1
0 A B -0.008035 0.008542
If you want chained operation, you can do:
import pandas as pd
data = {('A','B'): [-0.008035100996494293,0.008541940711438656]}
df = (
pd.DataFrame.from_dict(data, orient='index')
.pipe(lambda d:
d.set_axis(pd.MultiIndex.from_tuples(d.index, names=['title', 'heading']))
)
.reset_index()
)
print(df)
title heading 0 1
0 A B -0.008035 0.008542
Another possible solution, which works also if the tuples and lists vary in length:
pd.concat([pd.DataFrame.from_records([x for x in d.keys()],
columns=['title', 'h1', 'h2']),
pd.DataFrame.from_records([x[1] for x in d.items()])], axis=1)
Output:
title h1 h2 0 1 2
0 A B None -0.008035 0.008542 NaN
1 C B D -0.010351 1.008542 5.0
Data input:
d = {('A','B'): [-0.008035100996494293,0.008541940711438656],
('C','B', 'D'): [-0.01035100996494293,1.008541940711438656, 5]}
You can expand the keys and values as you iterate the dictionary items. Pandas will see 4 values which it will make into a row.
>>> import pandas as pd
>>> data = {('A','B'): [-0.008035100996494293,0.008541940711438656]}
>>> pd.DataFrame(((*d[0], *d[1]) for d in data.items()), columns=("Title", "Heading", "Foo", "Bar"))
Title Heading Foo Bar
0 A B -0.008035 0.008542
I have the following dataframe:
import pandas as pd
df = pd.DataFrame({'code': ['52511', '52512', '12525', '13333']})
and the following list:
list = ['525', '13333']
I want to consider only the observations of df that start witht the element of list.
Desired output:
import pandas as pd
df = pd.DataFrame({'code': ['52511', '52512', '13333']})
The startswith function supports tuple type. You can convert list to tuple.
listt = ['525', '13333']
df=df[df['code'].str.startswith(tuple(listt))]
df
'''
code
0 52511
1 52512
3 13333
'''
I have a drug database saved in a SINGLE column in CSV file that I can read with Pandas. The file containts 750000 rows and its elements are devided by "///". The column also ends with "///". Seems every row is ended with ";".
I would like to split it to multiple columns in order to create structured database. Capitalized words (drug information) like "ENTRY", "NAME" etc. will be headers of these new columns.
So it has some structure, although the elements can be described by different number and sort of information. Meaning some elements will just have NaN in some cells. I have never worked with such SQL-like format, it is difficult to reproduce it as Pandas code, too. Please, see the PrtScs for more information.
An example of desired output would look like this:
df = pd.DataFrame({
"ENTRY":["001", "002", "003"],
"NAME":["water", "ibuprofen", "paralen"],
"FORMULA":["H2O","C5H16O85", "C14H24O8"],
"COMPONENT":[NaN, NaN, "paracetamol"]})
I am guessing there will be .split() involved based on CAPITALIZED words? The Python 3 code solution would be appreciated. It can help a lot of people. Thanks!
Whatever he could, he helped:
import pandas as pd
cols = ['ENTRY', 'NAME', 'FORMULA', 'COMPONENT']
# We create an additional dataframe.
dfi = pd.DataFrame()
# We read the file, get two columns and leave only the necessary lines.
df = pd.read_fwf(r'drug', header=None, names=['Key', 'Value'])
df = df[df['Key'].isin(cols)]
# To "flip" the dataframe, we first prepare an additional column
# with indexing by groups from one 'ENTRY' row to another.
dfi['Key1'] = dfi['Key'] = df[(df['Key'] == 'ENTRY')].index
dfi = dfi.set_index('Key1')
df = df.join(dfi, lsuffix='_caller', rsuffix='_other')
df.fillna(method="ffill", inplace=True)
df = df.astype({"Key_other": "Int64"})
# Change the shape of the table.
df = df.pivot(index='Key_other', columns='Key_caller', values='Value')
df = df.reindex(columns=cols)
# We clean up the resulting dataframe a little.
df['ENTRY'] = df['ENTRY'].str.split(r'\s+', expand=True)[0]
df.reset_index(drop=True, inplace=True)
pd.set_option('display.max_columns', 10)
Small code refactoring:
import pandas as pd
cols = ['ENTRY', 'NAME', 'FORMULA', 'COMPONENT']
# We read the file, get two columns and leave only the necessary lines.
df = pd.read_fwf(r'C:\Users\ф\drug\drug', header=None, names=['Key', 'Value'])
df = df[df['Key'].isin(cols)]
# To "flip" the dataframe, we first prepare an additional column
# with indexing by groups from one 'ENTRY' row to another.
df['Key_other'] = None
df.loc[(df['Key'] == 'ENTRY'), 'Key_other'] = df[(df['Key'] == 'ENTRY')].index
df['Key_other'].fillna(method="ffill", inplace=True)
# Change the shape of the table.
df = df.pivot(index='Key_other', columns='Key', values='Value')
df = df.reindex(columns=cols)
# We clean up the resulting dataframe a little.
df['ENTRY'] = df['ENTRY'].str.split(r'\s+', expand=True)[0]
df['NAME'] = df['NAME'].str.split(r'\(', expand=True)[0]
df.reset_index(drop=True, inplace=True)
pd.set_option('display.max_columns', 10)
print(df)
Key ENTRY NAME FORMULA \
0 D00001 Water H2O
1 D00002 Nadide C21H28N7O14P2
2 D00003 Oxygen O2
3 D00004 Carbon dioxide CO2
4 D00005 Flavin adenine dinucleotide C27H33N9O15P2
... ... ... ...
11983 D12452 Fostroxacitabine bralpamide hydrochloride C22H30BrN4O8P. HCl
11984 D12453 Guretolimod C24H34F3N5O4
11985 D12454 Icenticaftor C12H13F6N3O3
11986 D12455 Lirafugratinib C28H24FN7O2
11987 D12456 Lirafugratinib hydrochloride C28H24FN7O2. HCl
Key COMPONENT
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
... ...
11983 NaN
11984 NaN
11985 NaN
11986 NaN
11987 NaN
[11988 rows x 4 columns]
Need a little more to bring to mind, I leave it to your work.
I have a list:
l = [{'County': 'SentenceCase'}, {'Postcode': 'UpperCase'}]
type(l) equals <class 'list'>
If I load it into a dataframe.
df = pd.DataFrame(l)
The first row then ends up as the column names.
County Postcode
0 SentenceCase NaN
1 NaN UpperCase
I've tried Header=None, etc. but nothing seems to work.
I would want the dataframe to be
Header1 Header2
0 County SentenceCase
1 Postcode UpperCase
I think you are just using the wrong structure. instead of a list of dictionaries, you should have a list of lists to load into the DataFrame:
[['County', 'SentenceCase'], ['Postcode', 'UpperCase']]
If for some reason you require your original structure, you could use something like this:
new_l = []
for item in l:
for key, value in item.items() :
new_l.append([key, value])
new_df = pd.DataFrame(new_l)
new_df.columns = ['Header1', 'Header2']
new_df
which will give:
Header1 Header2
0 County SentenceCase
1 Postcode UpperCase
Every dictionary in the list represent a row in the table, the key is the Header.
It should look like that:
l = [{'Header1': 'County','Header2':'SentenceCase'},
{'Header1': 'Postcode ','Header2':'UpperCase'}]
The question seems simple and arguably on the verge of stupid. But given my scenario, it seems that I would have to do exactly that in order to keep a bunch of calculations accross several dataframes efficient.
Scenario:
I've got a bunch of pandas dataframes where the column names are constructed by a name part and a time part such as 'AA_2018' and 'BB_2017'. And I'm doing calculations on different columns from different dataframes so I'll have to filter out the timepart. As an mcve let's just say that I'd like to subract the column containing 'AA' from the column containing 'BB' and ignore all other columns in this dataframe:
import pandas as pd
import numpy as np
dates = pd.date_range('20180101',periods=3)
df = pd.DataFrame(np.random.randn(3,3),index=dates,columns=['AA_2018', 'AB_2018', 'BB_2017'])
If i knew the exact name of the columns, this can easily be done using:
diff_series = df['AA_2018'] - df['BB_2017']
This would return a pandas series since I'm using single brackets [] as opposed to a datframe If I had used double brackets [[]].
My challenge:
diff_series is of type pandas.core.series.Series. But since I've got some filtering to do, I'm using df.filter() that returns a dataframe with one column and not a series:
# in:
colAA = df.filter(like = 'AA')
# out:
# AA_2018
# 2018-01-01 0.801295
# 2018-01-02 0.860808
# 2018-01-03 -0.728886
# in:
# type(colAA)
# out:
# pandas.core.frame.DataFrame
Snce colAA is of type pandas.core.frame.DataFrame, the following returns a dataframe too:
# in:
colAA = df.filter(like = 'AA')
colBB = df.filter(like = 'BB')
df_filtered = colBB - colAA
# out:
AA_2018 BB_2017
2018-01-01 NaN NaN
2018-01-02 NaN NaN
2018-01-03 NaN NaN
And that is not what I'm after. This is:
# in:
diff_series = df['AA_2018'] - df['BB_2017']
# out:
2018-01-01 0.828895
2018-01-02 -1.153436
2018-01-03 -1.159985
Why am I adamant in doing it this way?
Because I'd like to end up with a dataframe using .to_frame() with a specified name based on the filters I've used.
My presumably inefficient approach is this:
# in:
colAA_values = [item for sublist in colAA.values for item in sublist]
# (because colAA.values returns a list of lists)
colBB_values = [item for sublist in colBB.values for item in sublist]
serAA = pd.Series(colAA_values, colAA.index)
serBB = pd.Series(colBB_values, colBB.index)
df_diff = (serBB - serAA).to_frame(name = 'someFilter')
# out:
someFilter
2018-01-01 -0.828895
2018-01-02 1.153436
2018-01-03 1.159985
What I've tried / What I was hoping to work:
# in:
(df.filter(like = 'BB') - df.filter(like = 'AA')).to_frame(name = 'somefilter')
# out:
# AttributeError: 'DataFrame' object has no attribute 'to_frame'
# (Of course because df.filter() returns a one-column dataframe)
I was also hoping that df.filter() could be set to return a pandas series, but no.
I guess I could have asked this questions instead: How to convert pandas dataframe column to a pandas series? But that does not seem to have an efficient built-in oneliner either. Most search results handle the other way around instead. I've been messing around with potential work-arounds for quite some time now, and an obvious solution might be right around the corner, but I'm hoping some of you has a suggestion on how to do this efficiently.
All code elements for an easy copy&paste:
import pandas as pd
import numpy as np
dates = pd.date_range('20180101',periods=3)
df = pd.DataFrame(np.random.randn(3,3),index=dates,columns=['AA_2018', 'AB_2018', 'BB_2017'])
#diff_series = df[['AA_2018']] - df[['BB_2017']]
#type(diff_series)
colAA = df.filter(like = 'AA')
colBB = df.filter(like = 'BB')
df_filtered = colBB - colAA
#type(df_filtered)
#type(colAA)
#colAA.values
#colAA.values returns a list of lists that has to be flattened for use in pd.Series
colAA_values = [item for sublist in colAA.values for item in sublist]
colBB_values = [item for sublist in colBB.values for item in sublist]
serAA = pd.Series(colAA_values, colAA.index)
serBB = pd.Series(colBB_values, colBB.index)
df_diff = (serBB - serAA).to_frame(name = 'someFilter')
# Attempts:
# (df.filter(like = 'BB') - df.filter(like = 'AA')).to_frame(name = 'somefilter')
You need opposite of to_frame - DataFrame.squeeze - convert one column DataFrame to Series:
colAA = df.filter(like = 'AA')
colBB = df.filter(like = 'BB')
df_filtered = colBB.squeeze() - colAA.squeeze()
print (df_filtered)
2018-01-01 -0.479247
2018-01-02 -3.801711
2018-01-03 1.567574
Freq: D, dtype: float64