Object to pandas dataframe - python

I have a list of objects like in the test variable below:
#dataclasses.dataclass
class A:
a: float
b: float
c: float
#dataclasses.dataclass
class B:
prop: str
attr: List["A"]
test = [
B("z", [A('a', 'b', 'c'), A('d', 'l', 's')]),
B("a", [A('s', 'v', 'c')]),
]
And I want it to transform it into a pandas df like this:
prop a b c
0 z a b c
0 z d l s
1 a s v c
I can do it in several steps, but it seems unnecessary and inneficient as I'm going multiple times through the same data:
a = pd.DataFrame(
[obj.__dict__ for obj in test]
)
a
prop attr
0 z [A(a='a', b='b', c='c'), A(a='d', b='l', c='s')]
1 a [A(a='s', b='v', c='c')]
b = a.explode('attr')
b
prop attr
0 z A(a='a', b='b', c='c')
0 z A(a='d', b='l', c='s')
1 a A(a='s', b='v', c='c')
b[["a", "b", "c"]] = b.apply(lambda x: [x.attr.a, x.attr.b, x.attr.c], axis=1, result_type="expand")
b
prop attr a b c
0 z A(a='a', b='b', c='c') a b c
0 z A(a='d', b='l', c='s') d l s
1 a A(a='s', b='v', c='c') s v c
Can it be done a bit more efficient?

Use a combination of dataclasses.asdict and pd.json_normalize
In [59]: pd.json_normalize([dataclasses.asdict(k) for k in test], 'attr', ['prop'])
Out[59]:
a b c prop
0 a b c z
1 d l s z
2 s v c a

Another version:
df = pd.DataFrame({"prop": b.prop, **a.__dict__} for b in test for a in b.attr)
Result:
prop a b c
0 z a b c
1 z d l s
2 a s v c

Related

Splitting pandas column

I have a string column that I wish to split into three columns depending on the string. The column looks like this
full_string
x a b c
d e
m n o
y m n
y d e f
d e f
x and y are prefixes. I want to convert this column into three columns
prefix_string first_string last_string
x a c
d e
m o
y m n
y d f
d f
I have this code
df['first_string'] = df[df['full_string'].str.split().str.len() == 2]['full_string'].str.split().str[0]
df['first_string'] = df[df['full_string'].str.split().str.len() > 2]['full_string'].str.split().str[1]
df['last_string'] = df['full_string'].str.split().str[-1]
prefix_string = ['x', 'y']
df['prefix_string'] = df[df['full_string'].str.split().str[0].isin(prefix_string)]['full_string'].str.split().str[0]
This code isn't working correctly for first_string. Is there a way to extract the first string irrespective of prefix_string and the string length?
Try with numpy.where and pandas.Series.str.split:
import numpy as np
prefix_str = ["x", "y"]
res = df["full_string"].str.split(" ", expand=True).ffill(axis=1)
res["last_string"] = res.iloc[:, -1]
res["prefix_string"] = np.where(res[0].isin(prefix_str), res[0], "")
res["first_string"] = np.where(res["prefix_string"].ne(""), res[1], res[0])
res = res[["prefix_string", "first_string", "last_string"]]
Outputs:
prefix_string first_string last_string
0 x a c
1 d e
2 m o
3 y m n
4 y d f
5 d f
Instead of these lines in your Above code:
df['first_string'] = df[df['full_string'].str.split().str.len() == 2]['full_string'].str.split().str[0]
df['first_string'] = df[df['full_string'].str.split().str.len() > 2]['full_string'].str.split().str[1]
make use of split(),contains() and fillna() method:
df['first_string']=df['full_string'].str.split(expand=True).loc[~df['full_string'].str.split(expand=True)[0].str.contains('x|y'),0]
df['first_string']=df['first_string'].fillna(df['full_string'].str.split(expand=True)[1])
Output of df:
full_string first_string last_string prefix_string
0 x a b c a c x
1 d e d e NaN
2 m n o m o NaN
3 y m n m n y
4 y d e f d f y
5 d e f d f NaN

How to repeat certain rows of a dataframe?

I have a dataframe like this
import pandas as pd
df1 = pd.DataFrame({
'key': list('AAABBC'),
'prop1': list('xyzuuy'),
'prop2': list('mnbnbb')
})
key prop1 prop2
0 A x m
1 A y n
2 A z b
3 B u n
4 B u b
5 C y b
and a dictionary like this (user input):
d = {
'A': 2,
'B': 1,
'C': 3,
}
The keys of d refer to entries in column key in df1, the values indicate how often the rows of df1 that belong to the respective keys should be present: 1 means that nothing has to be done, 2 means all lines should be copied once, 3 they should be copied twice.
For the example above, the expected output looks as follows:
key prop1 prop2
0 A x m
1 A y n
2 A z b
3 B u n
4 B u b
5 C y b
6 A x m # <-- copied, copy 1
7 A y n # <-- copied, copy 1
8 A z b # <-- copied, copy 1
9 C y b # <-- copied, copy 1
10 C y b # <-- copied, copy 2
So, the rows that belong to A have been copied once and added to df1, nothing had to be done about the rows the belong to B and the rows that belong to C have been copied twice and were also added to df1.
I currently implement this as follows:
dfs_to_add = []
for el, val in d.items():
if val > 1:
_temp_df = pd.concat(
[df1[df1['key'] == el]] * (val-1)
)
dfs_to_add.append(_temp_df)
df_to_add = pd.concat(dfs_to_add)
df_final = pd.concat([df1, df_to_add]).reset_index(drop=True)
which gives me the desired output.
The code is rather ugly; does anyone see a more straightforward option to get to the same output?
The order is important, so in case of A, I would need
0 A x m
1 A y n
2 A z b
0 A x m
1 A y n
2 A z b
and not
0 A x m
0 A x m
1 A y n
1 A y n
2 A z b
2 A z b
We can sue concat + groupby
df=pd.concat([ pd.concat([y]*d.get(x)) for x , y in df1.groupby('key')])
key prop1 prop2
0 A x m
1 A y n
2 A z b
0 A x m
1 A y n
2 A z b
3 B u n
4 B u b
5 C y b
5 C y b
5 C y b
One way using Index.repeat with loc[] and series.map:
m = df1.set_index('key',append=True)
out = m.loc[m.index.repeat(df1['key'].map(d))].reset_index('key')
print(out)
key prop1 prop2
0 A x m
0 A x m
1 A y n
1 A y n
2 A z b
2 A z b
3 B u n
4 B u b
5 C y b
5 C y b
5 C y b
You can try repeat:
df1.loc[df1.index.repeat(df1['key'].map(d))]
Output:
key prop1 prop2
0 A x m
0 A x m
1 A y n
1 A y n
2 A z b
2 A z b
3 B u n
4 B u b
5 C y b
5 C y b
5 C y b
If order is not important, use another solutions.
If order is important get indices of repeated values, repeat by loc and add to original:
idx = [x for k, v in d.items() for x in df1.index[df1['key'] == k].repeat(v-1)]
df = df1.append(df1.loc[idx], ignore_index=True)
print (df)
key prop1 prop2
0 A x m
1 A y n
2 A z b
3 B u n
4 B u b
5 C y b
6 A x m
7 A y n
8 A z b
9 C y b
10 C y b
Using DataFrame.merge and np.repeat:
df = df1.merge(
pd.Series(np.repeat(list(d.keys()), list(d.values())), name='key'), on='key')
Result:
# print(df)
key prop1 prop2
0 A x m
1 A x m
2 A y n
3 A y n
4 A z b
5 A z b
6 B u n
7 B u b
8 C y b
9 C y b
10 C y b

Filtering pandas Dataframe on Parent Child Condition

I have a Dataframe df having 3 columns.
_id parent_id type
A B Subcategory_level
B null Main_Level
D A Product_Level
M N Product_Level
X Y Subcategory_Level
Z X Subcategory_Level
L Z Product_Level
What I want as my output is :
_id parent_id type
D A product_level
M N product_level
L X product_level
What I tried is , Drop all of the rows having type equals main_level. Then
df1=df
df1.rename(columns= {'_id':'parent_id','parent_id':'_id'},
index=str,inplace=True)
Then Natural join of df1 with df:
final_df=pd.merge(df,df1,on='parent_id', how='inner')
But the problem in this natural join is, if there are more than one level of type , it will not work. e.g The relation between X and L have 2 level of hierarchy, in that case it is not working
Is that what you're saying?
df[df.type == 'product_level']
_id parent_id type
D A product_level
M N product_level
L X product_level
# Maybe I don't understand what you mean. I thought it was.
In [2]: df = pd.DataFrame({"a":[1,2,3,4], "b":["x","t","s","g"], "x":["l1", "l3", "l1", "l2"]})
In [3]: df
Out[3]:
a b x
0 1 x l1
1 2 t l3
2 3 s l1
3 4 g l2
In [4]: df[df.x=="l1"]
Out[4]:
a b x
0 1 x l1
2 3 s l1

Clustering geodata into same size group with K-means in Python

I have the following sample geo-data which have 100 rows, and I want to cluster those POIs to 10 groups and 10 points in each group, also, if possible data in each group from same areas.
id areas lng lat
1010094160 A 116.31967 40.03229
1010737675 A 116.28941 40.03968
1010724217 A 116.32256 40.048
1010122181 A 116.28683 40.09652
1010732739 A 116.33482 40.06456
1010730289 A 116.3724 40.04066
1010737817 A 116.24174 40.074
1010124109 A 116.2558 40.08371
1010732695 B 116.31591 40.07096
1010112361 B 116.33331 39.96539
1010042095 B 116.31283 39.98804
1010097579 B 116.37637 39.98865
1010110203 B 116.41351 40.00851
1010085120 B 116.41364 39.98069
1010310183 B 116.42757 40.03738
1010087029 B 116.38947 39.97715
1010737155 B 116.38391 39.9849
1010729305 B 116.37803 40.04512
1010085100 B 116.37679 39.98838
1010750159 B 116.32162 39.98518
1010061742 B 116.31618 39.99087
1010091848 B 116.37617 39.97739
1010104343 C 116.3295 39.98156
1010091704 C 116.37236 39.9943
1010086652 C 116.36102 39.92978
1010030017 C 116.39017 39.99287
1010091851 C 116.35854 40.0063
1010705229 C 116.39114 39.97511
1010107321 C 116.42535 39.95417
1010130423 C 116.31651 40.04164
1010126133 C 116.29051 40.05081
1010177543 C 116.41114 39.99635
1010123271 C 116.35923 40.02031
1010315589 C 116.33906 39.99895
Here is the result expected
id areas lng lat clusterNumber
1010094160 A 116.31967 40.03229 0
1010737675 A 116.28941 40.03968 0
1010724217 A 116.32256 40.048 0
1010122181 A 116.28683 40.09652 0
1010732739 A 116.33482 40.06456 0
1010730289 A 116.3724 40.04066 0
1010737817 A 116.24174 40.074 0
1010124109 A 116.2558 40.08371 0
1010732695 B 116.31591 40.07096 0
1010112361 B 116.33331 39.96539 1
1010042095 B 116.31283 39.98804 1
1010097579 B 116.37637 39.98865 1
1010110203 B 116.41351 40.00851 1
1010085120 B 116.41364 39.98069 1
1010310183 B 116.42757 40.03738 1
1010087029 B 116.38947 39.97715 1
1010737155 B 116.38391 39.9849 1
1010729305 B 116.37803 40.04512 1
1010085100 B 116.37679 39.98838 1
1010750159 B 116.32162 39.98518 2
1010061742 B 116.31618 39.99087 2
1010091848 B 116.37617 39.97739 2
1010104343 C 116.3295 39.98156 2
1010091704 C 116.37236 39.9943 2
1010086652 C 116.36102 39.92978 2
1010030017 C 116.39017 39.99287 2
1010091851 C 116.35854 40.0063 2
1010705229 C 116.39114 39.97511 2
1010107321 C 116.42535 39.95417 2
1010130423 C 116.31651 40.04164 3
1010126133 C 116.29051 40.05081 3
I have tried with K-means, but I can't keep each group has same size. Are there any other better method I can use in Python? Please share your ideas and hint. Thanks.
Here is what I have tried:
X = []
for row in result:
X.append([float(row['lng']), float(row['lat'])])
X = np.array(X)
n_clusters = 100
cls = KMeans(n_clusters, random_state=0).fit(X)
#cls = EqualGroupsKMeans(n_clusters, random_state=0).fit(X)
#km1 = KMeans(n_clusters=6, n_init=25, max_iter = 600, random_state=0)
cls.labels_
markers = ['^','x','o','*','+', '+']
colors = ['b', 'c', 'g', 'k', 'm', 'r']
for i in range(n_clusters):
members = cls.labels_ == i
print(len(X[members,0]))
#plt.scatter(X[members,0],X[members,1],s=6,marker=markers[i],c=colors[i],alpha=0.5)
plt.scatter(X[members,0],X[members,1],s=6,marker="^",c=colors[i%6],alpha=0.5)
plt.title(' ')
plt.show()
Here is reference I find for Same-Size-K-Means from Github:
https://github.com/ndanielsen/Same-Size-K-Means

Complex function for getting combinations of one column with other

I have a table in pandas df
id_x id_y
a b
b c
a c
d a
x a
m b
c z
a k
b q
d w
a w
q v
How to read this table is :
the combinations for a is, a-b,a-c,a-k,a-w, similarly for b(b-c,b-q) and so on..
I want to write a function which takes id_x from the df def test_func(id)
and check whether the occurrences of that id is greater than 3 or not, which may be done by df['id_x'].value_counts .
for eg.
def test_func(id):
if id_count >= 3:
print 'yes'
ddf = df[df['id_x'] == id]
ddf.to_csv(id+".csv")
else:
print 'no'
while id_count <3:
# do something.(I've explained below what I have to do when count<3)
Say for b the occurrence is only 2(i.e b-c, and b-q) which is less than 3.
so in such case, look if 'c'(from id_y) has any combinations.
c has 1 combination(c-z) and similarly q has 1 combination(q-v)
thus b should be linked with z and v.
id_x id_y
b c
b q
b z
b v
and store it in ddf2 like we stored for >10.
Also for particular id,if I could have csv saved with the name of id.
I hope I explained my question correctly, I am very new to python and I don't know to write functions, this was my logic.
Can anyone help me with the implementation part.
Thanks in advance.
Edited: solution redesign according to comments
import pandas as pd
def direct_related(df, values, column_names=('x', 'y')):
rels = set()
for value in values:
for i, v in df[df[column_names[0]]==value][column_names[1]].iteritems():
rels.add(v)
return rels
def indirect_related(df, values, recursion=1, column_names=('x', 'y')):
rels = direct_related(df, values, column_names)
for i in range(recursion):
rels = rels.union(direct_related(df, rels, column_names))
return rels
def related(df, value, recursion=1, column_names=('x', 'y')):
rels = indirect_related(df, [value], recursion, column_names)
return pd.DataFrame(
{
column_names[0]: value,
column_names[1]: list(rels)
}
)
def min_related(df, value, min_appearances=3, max_recursion=10, column_names=('x', 'y')):
for i in range(max_recursion + 1):
if len(indirect_related(df, [value], i, column_names)) >= min_appearances:
return related(df, value, i, column_names)
return None
df = pd.DataFrame(
{
'x': ['a', 'b', 'a', 'd', 'x', 'm', 'c', 'a', 'b', 'd', 'a', 'q'],
'y': ['b', 'c', 'c', 'a', 'a', 'b', 'z', 'k', 'q', 'w', 'w', 'v']
}
)
print(min_related(df, 'b', 3))
First filter DataFrame by length (for testing < 3)
a = df.groupby('id_x').filter(lambda x: len(x) < 3)
print (a)
id_x id_y
1 b c
3 d a
4 x a
5 m b
6 c z
8 b q
9 d w
11 q v
Then filter where b and rename columns:
a1 = a.query("id_x == 'b'").rename(columns={'id_y':'id'})
print (a1)
id_x id
1 b c
8 b q
Also filter where are NOT b:
a2 = a.query("id_y != 'b'").rename(columns={'id_x':'id'})
print (a2)
id id_y
1 b c
3 d a
4 x a
6 c z
8 b q
9 d w
11 q v
Then merge by column id:
b = pd.merge(a1,a2, on='id').drop('id', axis=1)
print (b)
id_x id_y
0 b z
1 b v
Last concat filtered by b to new dataframe b:
c = pd.concat([a.query("id_x == 'b'"), b])
print (c)
id_x id_y
1 b c
8 b q
0 b z
1 b v

Categories

Resources