I have a dataframe with two columns that are json.
So for example,
df = A B C D
1. 2. {b:1,c:2,d:{r:1,t:{y:0}}} {v:9}
I want to flatten it entirely, so every value in the json will be in a seperate columns, and the name will be the full path. So here the value 0 will be in the column:
C_d_t_y
What is the best way to do it, and without having to predefine the depth of the json or the fields?
If your dataframe contains only nested dictionaries (no lists), you can try:
def get_values(df):
def _parse(val, current_path):
if isinstance(val, dict):
for k, v in val.items():
yield from _parse(v, current_path + [k])
else:
yield "_".join(map(str, current_path)), val
rows = []
for idx, row in df.iterrows():
tmp = {}
for i in row.index:
tmp.update(dict(_parse(row[i], [i])))
rows.append(tmp)
return pd.DataFrame(rows, index=df.index)
print(get_values(df))
Prints:
A B C_b C_c C_d_r C_d_t_y D_v
0 1 2 1 2 1 0 9
Related
I have a pandas dataframe with nested lists as values in a column as follows:
sample_df = pd.DataFrame({'single_proj_name': [['jsfk'],['fhjk'],['ERRW'],['SJBAK']],
'single_item_list': [['ABC_123'],['DEF123'],['FAS324'],['HSJD123']],
'single_id':[[1234],[5678],[91011],[121314]],
'multi_proj_name':[['AAA','VVVV','SASD'],['QEWWQ','SFA','JKKK','fhjk'],['ERRW','TTTT'],['SJBAK','YYYY']],
'multi_item_list':[[['XYZAV','ADS23','ABC_123'],['XYZAV','ADS23','ABC_123']],['XYZAV','DEF123','ABC_123','SAJKF'],['QWER12','FAS324'],[['JFAJKA','HSJD123'],['JFAJKA','HSJD123']]],
'multi_id':[[[2167,2147,29481],[2167,2147,29481]],[2313,57567,2321,7898],[1123,8775],[[5237,43512],[5237,43512]]]})
As you can see above, in some columns, the same list is repeated twice or more.
So, I would like to remove the duplicated list and only retain one copy of the list.
I was trying something like the below:
for i, (single, multi_item, multi_id) in enumerate(zip(sample_df['single_item_list'],sample_df['multi_item_list'],sample_df['multi_id'])):
if (any(isinstance(i, list) for i in multi_item)) == False:
for j, item_list in enumerate(multi_item):
if single[0] in item_list:
pos = item_list.index(single[0])
sample_df.at[i,'multi_item_list'] = [item_list]
sample_df.at[i,'multi_id'] = [multi_id[j]]
else:
print("under nested list")
for j, item_list in enumerate(zip(multi_item,multi_id)):
if single[0] in multi_item[j]:
pos = multi_item[j].index(single[0])
sample_df.at[i,'multi_item_list'][j] = single[0]
sample_df.at[i,'multi_id'][j] = multi_id[j][pos]
else:
sample_df.at[i,'multi_item_list'][j] = np.nan
sample_df.at[i,'multi_id'][j] = np.nan
But this assigns NA to the whole column value. I expect to remove that specific list (within a nested list).
I expect my output to be like as below:
In the data it looks like removing duplicates is equivalent to keeping the first element in any list of lists while any standard lists are kept as they are. If this is true, then you can solve it as follows:
def get_first_list(x):
if isinstance(x[0], list):
return [x[0]]
return x
for c in ['multi_item_list', 'multi_id']:
sample_df[c] = sample_df[c].apply(get_first_list)
Result:
single_proj_name single_item_list single_id multi_proj_name multi_item_list multi_id
0 [jsfk] [ABC_123] [1234] [AAA, VVVV, SASD] [[XYZAV, ADS23, ABC_123]] [[2167, 2147, 29481]]
1 [fhjk] [DEF123] [5678] [QEWWQ, SFA, JKKK, fhjk] [XYZAV, DEF123, ABC_123, SAJKF] [2313, 57567, 2321, 7898]
2 [ERRW] [FAS324] [91011] [ERRW, TTTT] [QWER12, FAS324] [1123, 8775]
3 [SJBAK] [HSJD123] [121314] [SJBAK, YYYY] [[JFAJKA, HSJD123]] [[5237, 43512]]
To handle the case where there can be more than a single unique list the get_first_list method can be adjusted to:
def get_first_list(x):
if isinstance(x[0], list):
new_x = []
for i in x:
if i not in new_x:
new_x.append(i)
return new_x
return x
This will keep the order of the sublists while removing any sublist duplicates.
Shortly with np.unique function:
cols = ['multi_item_list', 'multi_id']
sample_df[cols] = sample_df[cols].apply(lambda x: [np.unique(a, axis=0) if type(a[0]) == list else a for a in x.values])
In [382]: sample_df
Out[382]:
single_proj_name single_item_list single_id multi_proj_name \
0 [jsfk] [ABC_123] [1234] [AAA, VVVV, SASD]
1 [fhjk] [DEF123] [5678] [QEWWQ, SFA, JKKK, fhjk]
2 [ERRW] [FAS324] [91011] [ERRW, TTTT]
3 [SJBAK] [HSJD123] [121314] [SJBAK, YYYY]
multi_item_list multi_id
0 [[XYZAV, ADS23, ABC_123]] [[2167, 2147, 29481]]
1 [XYZAV, DEF123, ABC_123, SAJKF] [2313, 57567, 2321, 7898]
2 [QWER12, FAS324] [1123, 8775]
3 [[JFAJKA, HSJD123]] [[5237, 43512]]
I have a dataframe df
Col1 Col2
A B
C A
B D
E F
G D
G H
K J
and a Series id of IDs
ID
A
F
What I want is, for all letters in id, to select other letters that have any link with a max of 2 intermediates.
Let's make the example for A (way easier to understand with the example) :
There are 2 lines including A, linked to B and C, so direct links to A are [B, C]. (No matter if A is in Col1 or Col2)
A B
C A
But B is also linked to D, and D is linked to G :
B D
G D
So links to A are [B, C, D, G].
Even though G and H are linked, it would make more than 2 intermediates from A (A > B > D > G > H making B, D and G as intermediates), so I don't include H in A links lists.
G H
I'm looking for a way to search, for all IDs in id, the links list, and save it in id :
ID LinksList
A [B, C, D, G]
F [E]
I don't mind the type of LinksList (it can be String) as far as I can get the info for a specific ID and work with it. I also don't mind the order of IDs in LinksList, as long as it's complete.
I already found a way to solve the problem, but using 3 for loops, so it takes a really long time.
(For k1 in ID, For k2 range(0,3), select direct links for each element of LinksList + starting element, and put them in LinksList if they're not already in).
Can someone please help me doing it only with Pandas ?
Thanks a lot in advance !!
==== EDIT : Here are the "3 loops", after Karl's comment : ====
i = 0
for k in id:
linklist = list(df[df['Col1'] == k]['Col2']) + list(df[df['Col2'] == k]['Col1'])
new = df.copy()
intermediate_count = 1
while(len(new) > 0 and intermediate_count <= 2):
nn = new.copy()
new = []
for n in nn:
toadd = list(df[df['Col1'] == n]['Col2']) + list(df[df['Col2'] == n]['Col1'])
toadd = list(set(toadd).difference(df))
df = df + toadd
new = new + toadd
if(i==0):
d = {'Id': k, 'Linked': linklist}
df_result = pd.DataFrame(data=d)
i = 1
else:
d = {'Id': k, 'Linked': linklist}
df_result.append(pd.DataFrame(data=d))
I would first append the reciprocal of the dataframe to be able to always go from Col1 to Col2. Then I would use merges to compute the possible results with 1 and 2 intermediate steps. Finally, I would aggregate all those values into sets. Code could be:
# append the symetric (Col2 -> Col1) to the end of the dataframe
df2 = df.append(df.reindex(columns=reversed(df.columns)).rename(
columns={df.columns[len(df.columns)-i]: col
for i, col in enumerate(df.columns, 1)}), ignore_index=True
).drop_duplicates()
# add one step on Col3
df3 = df2.merge(df2, 'left', left_on='Col2', right_on='Col1',
suffixes=('', '_')).drop(columns='Col1_').rename(
columns={'Col2_': 'Col3'})
# add one second stop on Col4
df4 = df3.merge(df2, 'left', left_on='Col3', right_on='Col1',
suffixes=('', '_')).drop(columns='Col1_').rename(
columns={'Col2_': 'Col4'})
# aggregate Col2 to Col4 into a set
df4['Links'] = df4.iloc[:, 1:].agg(set, axis=1)
# aggregate that new column grouped by Col1
result = df4.groupby('Col1')['Links'].agg(lambda x: set.union(*x)).reset_index()
# remove the initial value if present in Links
result['Links'] = result['Links'] - result['Col1'].apply(set)
# and display the result restricted to id
print(result[result['Col1'].isin(id)])
With the sample data, it gives as expected:
Col1 Links
0 A {D, C, B, G}
5 F {E}
We can use Networkx library:
import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt
# Read in pandas dataframe using copy and paste
df = pd.read_clipboard()
# Create graph network from pandas dataframe
G = nx.from_pandas_edgelist(df, 'Col1', 'Col2')
# Create id, Series
id = pd.Series(['A', 'F'])
# Move values in the index of the Series
id.index=id
# Use `single_source_shortest_path` method in nx for each value in, id, Series
id.apply(lambda x: list(nx.single_source_shortest_path(G, x, 3).keys())[1:])
Output:
A [B, C, D, G]
F [E]
dtype: object
Print graph representation:
I have a dataframe like below
df
A B C
0 1 TRANSIT_1
TRANSIT_3
0 TRANSIT_5
And I want to change it to below:
Resulting DF
A B C D
0 1 TRANSIT_1
TRANSIT_3
0 TRANSIT_5
So i tried to use str.contains and once I receive the series with True or False, i put it in eval function to somehow get me the table I want.
Code I tried:
series_index = pd.DataFrame()
series_index = df.columns.str.contains("^TRANSIT_", case=True, regex=True)
print(type(series_index))
series_index.index[series_index].tolist()
I thought to use eval function to write it to separate column,like
df = eval(df[result]=the index) # I dont know, But eval function does evaluation and puts it in a separate column
I couldn't find a simple one-liner, but this works:
idx = list(df1[df1.where(df1.applymap(lambda x: 'TRA' in x if isinstance(x, str) else False)).notnull()].stack().index)
a, b = [], []
for sublist in idx:
a.append(sublist[0])
b.append(sublist[1])
df1['ans'] = df1.lookup(a,b)
Output
A B C ans
0 0 1 TRANSIT_1 TRANSIT_1
1 TRANSIT_3 None None TRANSIT_3
2 0 TRANSIT_5 None TRANSIT_5
I want to group by key some rows in a RDD so I can perform more advanced operations with the rows within one group. Please note, I do not want to calculate merely some aggregate values. The rows are key-value pairs, where the key is a GUID and the value is a complex object.
As per pyspark documentation, I first tried to implement this with combineByKey as it is supposed be more performant than groupByKey. The list at the beginning is just for illustration, not my real data:
l = list(range(1000))
numbers = sc.parallelize(l)
rdd = numbers.map(lambda x: (x % 5, x))
def f_init(initial_value):
return [initial_value]
def f_merge(current_merged, new_value):
if current_merged is None:
current_merged = []
return current_merged.append(new_value)
def f_combine(merged1, merged2):
if merged1 is None:
merged1 = []
if merged2 is None:
merged2 = []
return merged1 + merged2
combined_by_key = rdd.combineByKey(f_init, f_merge, f_combine)
c = combined_by_key.collectAsMap()
i = 0
for k, v in c.items():
if v is None:
print(i, k, 'value is None.')
else:
print(i, k, len(v))
i += 1
The output of this is:
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
Which is not what I expected. The same logic but implemented with groupByKey returns a correct output:
grouped_by_key = rdd.groupByKey()
d = grouped_by_key.collectAsMap()
i = 0
for k, v in d.items():
if v is None:
print(i, k, 'value is None.')
else:
print(i, k, len(v))
i += 1
Returns:
0 0 200
1 1 200
2 2 200
3 3 200
4 4 200
So unless I'm missing something, this is the case when groupByKey is preferred over reduceByKey or combineByKey (the topic of related discussion: Is groupByKey ever preferred over reduceByKey).
It is the case when understanding basic APIs is preferred. In particular if you check list.append docstring:
?list.append
## Docstring: L.append(object) -> None -- append object to end
## Type: method_descriptor
you'll see that like the other mutating methods in Python API it by convention doesn't return modified object. It means that f_merge always returns None and there is no accumulation whatsoever.
That being said for most of the problems there much more efficient solutions than groupByKey but rewriting it with combineByKey (or aggregateByKey) is never one of these.
I have a huge table file that looks like the following. In order to work on individual products (name), I tried to use pandas groupby, but it seems to put the whole table (~10G) in memory, which I cannot afford.
name index change
A Q QQ
A Q QQ
A Q QQ
B L LL
C Q QQ
C L LL
C LL LL
C Q QQ
C L LL
C LL LL
The name column is well sorted and I will only care about one name at a time. I hope to use the following criteria on column "change" to filter each name:
Check if number of "QQ" overwhelms number of "LL". Basically, if the number of rows contain "QQ" subtracts the number of rows contain "LL" >=2, then discard/ignore the "LL" rows for this name from now on. If "LL" overwhelms "QQ", then discard rows with QQ. (E.g. A has 3 QQ and 0 LL, and C has 4 LL and 2 QQ. They both are fine.)
Resulting table:
name index change
A Q QQ
A Q QQ
A Q QQ
C L LL
C LL LL
C L LL
C LL LL
Comparing "change" to "index", if no change occurs (e.g. LL in both columns), the row is not valid. Further, for the valid changes, the remaining QQ or LL has to be continuous for >=3 times. Therefore C only has 2 valid changes, and it will be filtered out.
Resulting table:
name index change
A Q QQ
A Q QQ
A Q QQ
I wonder if there is a way to just work on the table name by name, and release memory after each name please. (And don't have to do the two criteria step by step.) Any hint or suggestion will be appreciated!
Because the file is sorted by "name", you can read the file row-by-row:
def process_name(name, data, output_file):
group_by = {}
for index, change in data:
if index not in group_by:
group_by[index] = []
group_by[index].append(change)
# do the step 1 filter logic here
# do the step 2 filter logic here
for index in group_by:
if index == group_by[index]:
# Because there is at least one "no change" this
# whole "name" can be thrown out, so return here.
return
output = []
for index in group_by:
output_file.write("%s\t%s\t%s\n" % (name, index, group_by[index]))
current_name = None
current_data = []
input_file = open(input_filename, "r")
output_file = open(output_filename, "w")
header = input_file.readline()
for row in input_file:
cols = row.strip().split("\t")
name = cols[0]
index = cols[1]
change = cols[2]
if name != current_name:
if name != None:
process_name(current_name, current_data, output_file)
current_name = name
current_data = []
current_data.append((index, change))
# process what's left in the buffer
if current_name is not None:
process_name(current_name, current_data, output_file)
input_file.close()
output_file.close()
I don't totally understand the logic you've explained in #1, so I left that blank. I also feel like you probably want to do step #2 first as that will quickly rule out entire "name"s.
Since your file is sorted and you only seem to be operating on the sub segments by name, perhaps just use Python's groupby and create a table for each name segment as you go:
from itertools import groupby
import pandas as pd
with open('/tmp/so.csv') as f:
header=next(f).split()
for k, segment in groupby(f, key=lambda line: line.split()[0]):
seg_data={k:[] for k in header}
for e in segment:
for key, v in zip(header, e.split()):
seg_data[key].append(v)
seg_fram=pd.DataFrame.from_dict(seg_data)
print k
print seg_fram
print
Prints:
A
change index name
0 QQ Q A
1 QQ Q A
2 QQ Q A
B
change index name
0 LL L B
C
change index name
0 QQ Q C
1 LL L C
2 LL LL C
3 QQ Q C
4 LL L C
5 LL LL C
Then the largest piece of memory you will have will be dictated by the largest contiguous group and not the size of the file.
You can use 1/2 the memory of that method by appending to the data frame row by row instead of building the intermediate dict:
with open('/tmp/so.csv') as f:
header=next(f).split()
for k, segment in groupby(f, key=lambda line: line.split()[0]):
seg_data={k:[] for k in header}
seg_fram=pd.DataFrame(columns=header)
for idx, e in enumerate(segment):
df=pd.DataFrame({k:v for k, v in zip(header, e.split())}, index=[idx])
seg_fram=seg_fram.append(df)
(might be slower though...)
If that does not work, consider using a disk database.