Pandas parse json in column and expand to new rows in dataframe - python

I have a dataframe containing (record formatted) json strings as follows:
In[9]: pd.DataFrame( {'col1': ['A','B'], 'col2': ['[{"t":"05:15","v":"20.0"}, {"t":"05:20","v":"25.0"}]',
'[{"t":"05:15","v":"10.0"}, {"t":"05:20","v":"15.0"}]']})
Out[9]:
col1 col2
0 A [{"t":"05:15","v":"20.0"}, {"t":"05:20","v":"2...
1 B [{"t":"05:15","v":"10.0"}, {"t":"05:20","v":"1...
I would like to extract the json and for each record add a new row to the dataframe:
co1 t v
0 A 05:15:00 20
1 A 05:20:00 25
2 B 05:15:00 10
3 B 05:20:00 15
I've been experimenting with the following code:
def json_to_df(x):
df2 = pd.read_json(x.col2)
return df2
df.apply(json_to_df, axis=1)
but the resulting dataframes are assigned as tuples, rather than creating new rows. Any advice?

The problem with apply is that you need to return mulitple rows and it expects only one. A possible solution:
def json_to_df(row):
_, row = row
df_json = pd.read_json(row.col2)
col1 = pd.Series([row.col1]*len(df_json), name='col1')
return pd.concat([col1,df_json],axis=1)
df = map(json_to_df, df.iterrows()) #returns a list of dataframes
df = reduce(lambda x,y:x.append(y), x) #glues them together
df
col1 t v
0 A 05:15 20
1 A 05:20 25
0 B 05:15 10
1 B 05:20 15

Ok, taking a little inspiration from hellpanderrr's answer above, I came up with the following:
In [92]:
pd.DataFrame( {'X': ['A','B'], 'Y': ['fdsfds','fdsfds'], 'json': ['[{"t":"05:15","v":"20.0"}, {"t":"05:20","v":"25.0"}]',
'[{"t":"05:15","v":"10.0"}, {"t":"05:20","v":"15.0"}]']},)
Out[92]:
X Y json
0 A fdsfds [{"t":"05:15","v":"20.0"}, {"t":"05:20","v":"2...
1 B fdsfds [{"t":"05:15","v":"10.0"}, {"t":"05:20","v":"1...
In [93]:
dfs = []
def json_to_df(row, json_col):
json_df = pd.read_json(row[json_col])
dfs.append(json_df.assign(**row.drop(json_col)))
_.apply(json_to_df, axis=1, json_col='json')
pd.concat(dfs)
Out[93]:
t v X Y
0 05:15 20 A fdsfds
1 05:20 25 A fdsfds
0 05:15 10 B fdsfds
1 05:20 15 B fdsfds

Related

How do you generate a rolling count the number of rows that are duplicated in Pandas? [duplicate]

I come from a sql background and I use the following data processing step frequently:
Partition the table of data by one or more fields
For each partition, add a rownumber to each of its rows that ranks the row by one or more other fields, where the analyst specifies ascending or descending
EX:
df = pd.DataFrame({'key1' : ['a','a','a','b','a'],
'data1' : [1,2,2,3,3],
'data2' : [1,10,2,3,30]})
df
data1 data2 key1
0 1 1 a
1 2 10 a
2 2 2 a
3 3 3 b
4 3 30 a
I'm looking for how to do the PANDAS equivalent to this sql window function:
RN = ROW_NUMBER() OVER (PARTITION BY Key1 ORDER BY Data1 ASC, Data2 DESC)
data1 data2 key1 RN
0 1 1 a 1
1 2 10 a 2
2 2 2 a 3
3 3 3 b 1
4 3 30 a 4
I've tried the following which I've gotten to work where there are no 'partitions':
def row_number(frame,orderby_columns, orderby_direction,name):
frame.sort_index(by = orderby_columns, ascending = orderby_direction, inplace = True)
frame[name] = list(xrange(len(frame.index)))
I tried to extend this idea to work with partitions (groups in pandas) but the following didn't work:
df1 = df.groupby('key1').apply(lambda t: t.sort_index(by=['data1', 'data2'], ascending=[True, False], inplace = True)).reset_index()
def nf(x):
x['rn'] = list(xrange(len(x.index)))
df1['rn1'] = df1.groupby('key1').apply(nf)
But I just got a lot of NaNs when I do this.
Ideally, there'd be a succinct way to replicate the window function capability of sql (i've figured out the window based aggregates...that's a one liner in pandas)...can someone share with me the most idiomatic way to number rows like this in PANDAS?
you can also use sort_values(), groupby() and finally cumcount() + 1:
df['RN'] = df.sort_values(['data1','data2'], ascending=[True,False]) \
.groupby(['key1']) \
.cumcount() + 1
print(df)
yields:
data1 data2 key1 RN
0 1 1 a 1
1 2 10 a 2
2 2 2 a 3
3 3 3 b 1
4 3 30 a 4
PS tested with pandas 0.18
Use groupby.rank function.
Here the working example.
df = pd.DataFrame({'C1':['a', 'a', 'a', 'b', 'b'], 'C2': [1, 2, 3, 4, 5]})
df
C1 C2
a 1
a 2
a 3
b 4
b 5
df["RANK"] = df.groupby("C1")["C2"].rank(method="first", ascending=True)
df
C1 C2 RANK
a 1 1
a 2 2
a 3 3
b 4 1
b 5 2
You can do this by using groupby twice along with the rank method:
In [11]: g = df.groupby('key1')
Use the min method argument to give values which share the same data1 the same RN:
In [12]: g['data1'].rank(method='min')
Out[12]:
0 1
1 2
2 2
3 1
4 4
dtype: float64
In [13]: df['RN'] = g['data1'].rank(method='min')
And then groupby these results and add the rank with respect to data2:
In [14]: g1 = df.groupby(['key1', 'RN'])
In [15]: g1['data2'].rank(ascending=False) - 1
Out[15]:
0 0
1 0
2 1
3 0
4 0
dtype: float64
In [16]: df['RN'] += g1['data2'].rank(ascending=False) - 1
In [17]: df
Out[17]:
data1 data2 key1 RN
0 1 1 a 1
1 2 10 a 2
2 2 2 a 3
3 3 3 b 1
4 3 30 a 4
It feels like there ought to be a native way to do this (there may well be!...).
You can use transform and Rank together Here is an example
df = pd.DataFrame({'C1' : ['a','a','a','b','b'],
'C2' : [1,2,3,4,5]})
df['Rank'] = df.groupby(by=['C1'])['C2'].transform(lambda x: x.rank())
df
Have a look at Pandas Rank method for more information
pandas.lib.fast_zip() can create a tuple array from a list of array. You can use this function to create a tuple series, and then rank it:
values = {'key1' : ['a','a','a','b','a','b'],
'data1' : [1,2,2,3,3,3],
'data2' : [1,10,2,3,30,20]}
df = pd.DataFrame(values, index=list("abcdef"))
def rank_multi_columns(df, cols, **kw):
data = []
for col in cols:
if col.startswith("-"):
flag = -1
col = col[1:]
else:
flag = 1
data.append(flag*df[col])
values = pd.lib.fast_zip(data)
s = pd.Series(values, index=df.index)
return s.rank(**kw)
rank = df.groupby("key1").apply(lambda df:rank_multi_columns(df, ["data1", "-data2"]))
print rank
the result:
a 1
b 2
c 3
d 2
e 4
f 1
dtype: float64

Pandas DataFrame efficiently split one column into multiple

I have a dataframe similar to this:
data = {"col_1": [0, 1, 2],
"col_2": ["abc", "defg", "hi"]}
df = pd.DataFrame(data)
Visually:
col_1 col_2
0 0 abc
1 1 defg
2 2 hi
What I'd like to do is split up each character in col_2, and append it as a new column to the dataframe
example iterative method:
def get_chars(string):
chars = []
for char in string:
chars.append(char)
return chars
char_df = pd.DataFrame()
for i in range(len(df)):
char_arr = get_chars(df.loc[i, "col_2"])
temp_df = pd.DataFrame(char_arr).T
char_df = pd.concat([char_df, temp_df], ignore_index=True, axis=0)
df = pd.concat([df, char_df], ignore_index=True, axis=1)
Which results in the correct form:
0 1 2 3 4 5
0 0 abc a b c NaN
1 1 defg d e f g
2 2 hi h i NaN NaN
But I believe iterating though the dataframe like this is very inefficient, so I want to find a faster (ideally vectorised) solution.
In reality, I'm not really splitting up strings, but the point of this question is to find a way to efficiently process one column, and return many.
If need performance use DataFrame constructor with convert values to lists:
df = df.join(pd.DataFrame([list(x) for x in df['col_2']], index=df.index))
Or:
df = df.join(pd.DataFrame(df['col_2'].apply(list).tolist(), index=df.index))
print (df)
col_1 col_2 0 1 2 3
0 0 abc a b c None
1 1 defg d e f g
2 2 hi h i None None

Split pandas data frame columns by row value

I have a data frame like this:
>df = pd.DataFrame({'A':['M',2,3],'B':['M',2,3],'AA':['N',20,30],'BB':['N',20,30]})
>df = df.rename(columns={df.columns[2]: 'A'})
>df = df.rename(columns={df.columns[3]: 'B'})
>df
A B A B
0 M M N N
1 2 2 20 20
2 3 3 30 30
and I have to split the data frame vertically by row index 0 = 'M' and 'N':
A B
0 M M
1 2 2
2 3 3
A B
0 N N
1 20 20
2 30 30
The data in the data frame comes from an Excel sheet and the column names are not unique.
Thanks for help!
This should get the job done:
df.loc[:,df.iloc[0, :] == "M"]
df.loc[:,df.iloc[0, :] == "N"]
Use pandas iloc for selecting columns:
=^..^=
import pandas as pd
df = pd.DataFrame({'A':['M',2,3],'B':['M',2,3],'AA':['N',20,30],'BB':['N',20,30]})
df = df.rename(columns={df.columns[2]: 'A'})
df = df.rename(columns={df.columns[3]: 'B'})
df1 = df.iloc[:, :2]
df2 = df.iloc[:, 2:]
Output:
A B
0 M M
1 2 2
2 3 3
A B
0 N N
1 20 20
2 30 30
Use list comprehension with loc as:
dfs = [df.loc[:, df.loc[0,:].eq(s)] for s in ['M','N']]
This gives seperate dataframes in list.

How do I hide the index column in pandas dataframe?

How am I supposed to remove the index column in the first row. I know it is not counted as a column but when I transpose the data frame, it does not allow me to use my headers anymore.
In[297] df = df.transpose()
print(df)
df = df.drop('RTM',1)
df = df.drop('Requirements', 1)
df = df.drop('Test Summary Report', 1)
print(df)
This throws me an error "labels ['RTM'] not contained in axis".
RTM is contained in an axis and this works if I do index_col=0
df = xl.parse(sheet_name,header=1, index_col=0, usecols="A:E", nrows=6, index_col=None)
but then I lose my (0,0) value "Artifact name" as a header. Any help will be appreciated.
You can do this with .iloc, to assign the column names to the first row after transposing. Then you can drop the first row, and clean up the name
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': list('ABCDE'),
'val1': np.arange(1,6,1),
'val2': np.arange(11,16,1)})
id val1 val2
0 A 1 11
1 B 2 12
2 C 3 13
3 D 4 14
4 E 5 15
Transpose and clean up the names
df = df.T
df.columns = df.iloc[0]
df = df.drop(df.iloc[0].index.name)
df.columns.name = None
df is now:
A B C D E
val1 1 2 3 4 5
val2 11 12 13 14 15
Alternatively, just create a new DataFrame to begin with, specifying which column you want to be the header column.
header_col = 'id'
cols = [x for x in df.columns if x != header_col]
pd.DataFrame(df[cols].values.T, columns=df[header_col], index=cols)
Output:
id A B C D E
val1 1 2 3 4 5
val2 11 12 13 14 15
Using the setup from #ALollz:
df.set_index('id').rename_axis(None).T
A B C D E
val1 1 2 3 4 5
val2 11 12 13 14 15

Extract first and last row of a dataframe in pandas

How can I extract the first and last rows of a given dataframe as a new dataframe in pandas?
I've tried to use iloc to select the desired rows and then concat as in:
df=pd.DataFrame({'a':range(1,5), 'b':['a','b','c','d']})
pd.concat([df.iloc[0,:], df.iloc[-1,:]])
but this does not produce a pandas dataframe:
a 1
b a
a 4
b d
dtype: object
I think the most simple way is .iloc[[0, -1]].
df = pd.DataFrame({'a':range(1,5), 'b':['a','b','c','d']})
df2 = df.iloc[[0, -1]]
print(df2)
a b
0 1 a
3 4 d
You can also use head and tail:
In [29]: pd.concat([df.head(1), df.tail(1)])
Out[29]:
a b
0 1 a
3 4 d
The accepted answer duplicates the first row if the frame only contains a single row. If that's a concern
df[0::len(df)-1 if len(df) > 1 else 1]
works even for single row-dataframes.
Example: For the following dataframe this will not create a duplicate:
df = pd.DataFrame({'a': [1], 'b':['a']})
df2 = df[0::len(df)-1 if len(df) > 1 else 1]
print df2
a b
0 1 a
whereas this does:
df3 = df.iloc[[0, -1]]
print df3
a b
0 1 a
0 1 a
because the single row is the first AND last row at the same time.
I think you can try add parameter axis=1 to concat, because output of df.iloc[0,:] and df.iloc[-1,:] are Series and transpose by T:
print df.iloc[0,:]
a 1
b a
Name: 0, dtype: object
print df.iloc[-1,:]
a 4
b d
Name: 3, dtype: object
print pd.concat([df.iloc[0,:], df.iloc[-1,:]], axis=1)
0 3
a 1 4
b a d
print pd.concat([df.iloc[0,:], df.iloc[-1,:]], axis=1).T
a b
0 1 a
3 4 d
Alternatively you can use take:
In [3]: df.take([0, -1])
Out[3]:
a b
0 1 a
3 4 d
Here is the same style as in large datasets:
x = df[:5]
y = pd.DataFrame([['...']*df.shape[1]], columns=df.columns, index=['...'])
z = df[-5:]
frame = [x, y, z]
result = pd.concat(frame)
print(result)
Output:
date temp
0 1981-01-01 00:00:00 20.7
1 1981-01-02 00:00:00 17.9
2 1981-01-03 00:00:00 18.8
3 1981-01-04 00:00:00 14.6
4 1981-01-05 00:00:00 15.8
... ... ...
3645 1990-12-27 00:00:00 14
3646 1990-12-28 00:00:00 13.6
3647 1990-12-29 00:00:00 13.5
3648 1990-12-30 00:00:00 15.7
3649 1990-12-31 00:00:00 13

Categories

Resources