Related
I have the following input :
my_list = ["x d1","y d1","z d2","t d2"]
And would like to transform it into :
Expected_result = ["d1(x,y)","d2(z,t)"]
I had to use brute force, and also had to call pandas to my rescue, since I didn't find any way to do it in plain/vanilla python. Do you have any other way to solve this?
import pandas as pd
my_list = ["x d1","y d1","z d2","t d2"]
df = pd.DataFrame(my_list,columns=["col1"])
df2 = df["col1"].str.split(" ",expand = True)
df2.columns = ["col1","col2"]
grp = df2.groupby(["col2"])
result = []
for grp_name, data in grp:
res = grp_name +"(" + ",".join(list(data["col1"])) + ")"
result.append(res)
print(result)
The code defines an empty dictionary.
It then iterates over each item in your list and uses the split() method to split item into a key and a value.
Then uses the setdefault() method to add the key and the value to the empty dictionary. If the value already exists as a key in the dictionary, it appends the key to that value's existing list of keys. And if the value does not exist as a key in the dictionary, it creates a new key-value pair with the value as the key and the key as the first element in the new list.
Finally, the list comprehension iterates over the items in the dictionary and creates a string for each key-value pair using join() method to concatenate the keys in the value list into a single string.
result = {}
for item in my_list:
key, value = item.split()
result.setdefault(value, []).append(key)
output = [f"{k}({', '.join(v)})" for k, v in result.items()]
print(output)
['d1(x, y)', 'd2(z, t)']
If your values are already sorted by key (d1, d2), you can use itertools.groupby:
from itertools import groupby
out = [f"{k}({','.join(x[0] for x in g)})"
for k, g in groupby(map(str.split, my_list), lambda x: x[1])]
Output:
['d1(x,y)', 'd2(z,t)']
Otherwise you should use a dictionary as shown by #Jamiu.
A variant of your pandas solution:
out = (df['col1'].str.split(n=1, expand=True)
.groupby(1)[0]
.apply(lambda g: f"{g.name}({','.join(g)})")
.tolist()
)
my_list = ["x d1","y d1","z d2","t d2"]
res = []
for item in my_list:
a, b, *_ = item.split()
if len(res) and b in res[-1]:
res[-1] = res[-1].replace(')', f',{a})')
else:
res.append(f'{b}({a})')
print(res)
['d1(x,y)', 'd2(z,t)']
Let N be the number that follows d, this code works for any number of elements within dN, as long as N is ordered, that is, d1 comes before d2, which comes before d3, ... Works with any value of N , and you can use any letter in the d link as long as it has whatever value is in dN and then dN, keeping that order, "val_in_dN dN"
If you need something that works even if the dN are not in sequence, just say the word, but it will cost a little more
Another possible solution, which is based on pandas:
(pd.DataFrame(np.array([str.split(x, ' ') for x in my_list]), columns=['b', 'a'])
.groupby('a')['b'].apply(lambda x: f'({x.values[0]}, {x.values[1]})')
.reset_index().sum(axis=1).tolist())
Output:
['d1(x, y)', 'd2(z, t)']
EDIT
The OP, #ShckTchamna, would like to see the above solution modified, in order to be more general: The reason of this edit is to provide a solution that works with the example the OP gives in his comment below.
my_list = ["x d1","y d1","z d2","t d2","kk d2","m d3", "n d3", "s d4"]
(pd.DataFrame(np.array([str.split(x, ' ') for x in my_list]), columns=['b', 'a'])
.groupby('a')['b'].apply(lambda x: f'({",".join(x.values)})')
.reset_index().sum(axis=1).tolist())
Output:
['d1(x,y)', 'd2(z,t,kk)', 'd3(m,n)', 'd4(s)']
import pandas as pd
df = pd.DataFrame(data=[e.split(' ') for e in ["x d1","y d1","z d2","t d2"]])
r = (df.groupby(1)
.apply(lambda r:"{0}({1},{2})".format(r.iloc[0,1], r.iloc[0,0], r.iloc[1,0]))
.reset_index()
.rename({1:"points", 0:"coordinates"}, axis=1)
)
print(r.coordinates.tolist())
# ['d1(x,y)', 'd2(z,t)']
print(r)
# points coordinates
# 0 d1 d1(x,y)
# 1 d2 d2(z,t)
In replacement of my previous one (that works too) :
import itertools as it
my_list = [e.split(' ') for e in ["x d1","y d1","z d2","t d2"]]
r=[]
for key, group in it.groupby(my_list, lambda x: x[1]):
l=[e[0] for e in list(group)]
r.append("{0}({1},{2})".format(key, l[0], l[1]))
print(r)
Output :
['d1(x,y)', 'd2(z,t)']
I have a dataframe as follows:
df
KEY NAME ID_LOCATION _GEOM
0 61196 name1 [(u'-88.121429', u'41.887726')] [[[lon00,lat00],[lon01, lat01]]]
1 61197 name2 [(u'-75.161934', u'38.725163')] [[[lon10,lat10], [lon11,lat11],...]]
2 61199 name3 [(u'-88.121429', u'41.887726'), (-77.681931, 37.548851)] [[[lon20, lat20],[lon21, lat21]]]
where id_loc is a list of tuples. How can I groupby id_loc in a way that if there is a matching (lon, lat) pair, merge those 2 rows and other columns by separated by comma.
expected_output_df
KEY NAME ID_LOCATION _GEOM
0 61196,61199 name1,name3 [(u'-85.121429', u'40.887726'), (-77.681931, 37.548851)] [[[lon00, lat00],[lon01, lat01],[lon20, lat20],[lon21, lat21]]]
1 61197 name2 [(u'-72.161934', u'35.725163')] [[[lon10,lat10], [lon11,lat11],...]]
I tried the following but no success and gives me error as unhashable type list:
def f(x):
return pd.Series(dict(KEY='{%s}' % ', '.join(x['KEY']),
NAME='{%s}' % ', '.join(x['NAME']),
ID_LOCATION='{%s}' % ', '.join(x['ID_LOCATION']),
_GEOM='{%s}' % ', '.join(x['_GEOM']))
)
df = df.groupby('ID_LOCATION').apply(f)
I think this should work.
First convert things into lists of the same type (so that sum will append things together).
df = pd.DataFrame(
[[['61196'], ['name1'], [('-88.121429', '41.887726')]], [['61197'], ['name2'], [('-75.161934', '38.725163')]], [['61199'], ['name3'], [('-88.121429', '41.887726'), ('-77.681931', '37.548851')]]],
columns=['KEY', 'NAME', 'id_loc']
)
Then get pairwise combinations of rows (for id_loc) - ie, pairs of rows to merge together.
# Loop through all pairwise combination of rows (will need index so loop over range() instead of raw values).
to_merge = [] # list of index-tuples, rows to merge together.
for i, j in itertools.combinations(range(len(df['id_loc'].values)), 2):
a = df['id_loc'].values[i]
b = df['id_loc'].values[j]
# Check for shared elemnts.
if not set(a).isdisjoint(b):
# Shared elements found.
to_merge.append([i,j])
Now handle the case where there are 3 or more rows, ie to_merge = [[1, 2], [2, 3]] should be to_merge = [[1, 2, 3]].
def find_intersection(m_list):
for i,v in enumerate(m_list) :
for j,k in enumerate(m_list[i+1:],i+1):
if v & k:
s[i]=v.union(m_list.pop(j))
return find_intersection(m_list)
return m_list
to_merge = [set(i) for i in to_merge if i]
to_merge = find_intersection(to_merge)
to_merge = [list(x) for x in to_merge]
(found from this answer)
Go through and sum all the rows that need to be merged (and drop pre-merge rows)
for idx_list in to_merge:
df.iloc[idx_list[0], :] = df.iloc[idx_list, :].sum()
df.iloc[idx_list[1:], :] = np.nan
df = df.dropna()
df['id_loc'] = df['id_loc'].apply(lambda x: list(set(x))) # shared coords would be duped.
print(df)
Antoine Zambelli answer is very good; as exercise, but also in hope it can help anyway, I wanna share my personal approach to the subject. It's not fully tested, but it should work.
# fun to merge elements
def merge_elements(ensemble, column):
upper_list = []
for index in ensemble:
element_list = []
for item in index:
if not isinstance(df.loc[item, column], list):
if not df.loc[item, column] in element_list:
element_list.append(df.loc[item, column])
else:
for obj in df.loc[item, column]:
if not obj in element_list:
element_list.append(obj)
upper_list.append([element_list, index])
return upper_list
# put results in dataframe
def put_in_df(df, piped, column):
for elem in piped:
for i in range(len(elem[1])):
if column == "NAME" or column == "_GEOM":
df.loc[elem[1][i], column] = str(elem[0]).replace("'", "")
else:
df.loc[elem[1][i], column] = str(elem[0])
# get list from df
list_of_locations = df.ID_LOCATION.tolist()
# get list of rows that need to be merged (no itertools needed)
# the dictionary I used here is an "overkill", I had no actual need for it, so also a common list can suit perfectly
rows = {}
for i, item in enumerate(list_of_locations):
if isinstance(item, list):
for j in range(0, len(item)):
if item[j] in rows:
rows[item[j]] = [rows[item[j]], i]
else:
rows[item[j]] = i
else:
if item in rows:
rows[item] = [rows[item], i]
else:
rows[item] = i
ensemble = []
# as I said there was no need for a dictionary, this step can be summarized
for item in rows.values():
if isinstance(item, list):
ensemble.append(item)
# conversion to tuple is optional
ensemble = tuple(ensemble)
# merge list of tuples according to indexes retrieved
put_in_df(df, merge_elements(ensemble, "ID_LOCATION"), "ID_LOCATION")
put_in_df(df, merge_elements(ensemble, "NAME"), "NAME")
put_in_df(df, merge_elements(ensemble, "KEYS"), "KEYS")
put_in_df(df, merge_elements(ensemble, "_GEOM"), "_GEOM")
# special thanks to: https://stackoverflow.com/questions/43855462/pandas-drop-duplicates-method-not-working?rq=1
df = df.iloc[df.astype(str).drop_duplicates().index]
as I also put in comments, thanks to Pandas drop_duplicates method not working for dropping duplicates even in presence of lists
Is there a way to use the elements of a string-list first as strings and then as int?
l = ['A','B','C']
for n in l:
use n as string (do some operations)
convert n to int(element of NG)
use n as int
I tried to Play around with range/len but I didnt come to a solution.
Edit2:
This is what I have:
import pandas as pd
import matplotlib.pyplot as plt
NG = ['A','B','C']
l = [1,2,3,4,5,6]
b = [6,5,4,3,2,1]
for n in NG:
print(n)
dflist = []
df = pd.DataFrame(l)
dflist.append(df)
df2 = pd.DataFrame(b)
dflist.append(df2)
df = pd.concat(dflist, axis = 1)
df.plot()
The Output are 3 figures that look like this:
But I want them to be in one figure:
import pandas as pd
import matplotlib.pyplot as plt
NG = ['A','B','C']
l = [1,2,3,4,5,6]
b = [6,5,4,3,2,1]
for n in NG:
print(n)
dflist = []
df = pd.DataFrame(l)
dflist.append(df)
df2 = pd.DataFrame(b)
dflist.append(df2)
df = pd.concat(dflist, axis = 1)
ax = plt.subplot(6, 2, n + 1)
df.plot(ax = ax)
This code works, but only if the list NG is made out of integers [1,2,3]. But I have it in strings. And I Need them in the Loop.
How to access both the element of list and its index?
That is the real question as I understood mainly from this comment. And it is a pretty common and straightforward piece of code:
NG = ['A', 'B', 'C']
for i in range(len(NG)):
print(i)
print(NG[i])
Here is my 2 cents:
>>> for n in l:
... print ord(n)
... print n
...
65
A
66
B
67
C
To convert back to char
>>> chr(65)
'A'
I think integer mean here is ascii value of character
so you can use again your ascii value to play with characters
my solution is you have to type cast your variables like this ,
and use ord() function to get ascii values
l = ['A','B','C']
for i in range(0,len(l)):
print("string value is ",l[i])
# now for integer values
if type(l[i]) != 'int':
print("ascii value of this char is ",ord(l[i]))
else:
print("already int type go on..")
as because there is no meaning of having a int value of characters , int value of character generally refers as ascii value may be some other formats
Use enumerate to iterate both on the indices and the letters.
NG = ['A','B','C']
for i, n in enumerate(NG, 1):
print(i, n)
Will output:
(1, 'A')
(2, 'B')
(3, 'C')
In your case, because you don't need the letters at all in your loop, you can use the underscore _ to notify coders in your future about what your code do - it uses the len of NG just for the indices.
for i, _ in enumerate(NG, 1):
I am trying to compare two csv files to look for common values in column 1.
import csv
f_d1 = open('test1.csv')
f_d2 = open('test2.csv')
f1_csv = csv.reader(f_d1)
f2_csv = csv.reader(f_d2)
for x,y in zip(f1_csv,f2_csv):
print(x,y)
I am trying to compare x[0] with y[0]. I am fairly new to python and trying to find the most pythonic way to achieve the results. Here is the csv files.
test1.csv
Hadrosaurus,1.2
Struthiomimus,0.92
Velociraptor,1.0
Triceratops,0.87
Euoplocephalus,1.6
Stegosaurus,1.4
Tyrannosaurus Rex,2.5
test2.csv
Euoplocephalus,1.87
Stegosaurus,1.9
Tyrannosaurus Rex,5.76
Hadrosaurus,1.4
Deinonychus,1.21
Struthiomimus,1.34
Velociraptor,2.72
I believe you're looking for the set intersection:
import csv
f_d1 = open('test1.csv')
f_d2 = open('test2.csv')
f1_csv = csv.reader(f_d1)
f2_csv = csv.reader(f_d2)
x = set([item[0] for item in f1_csv])
y = set([item[0] for item in f2_csv])
print(x & y)
Assuming that the files are not prohibitively large, you can read both of them with a CSV reader, convert the first columns to sets, and calculate the set intersection:
with open('test1.csv') as f:
set1 = set(x[0] for x in csv.reader(f))
with open('test2.csv') as f:
set2 = set(x[0] for x in csv.reader(f))
print(set1 & set2)
#{'Hadrosaurus', 'Euoplocephalus', 'Tyrannosaurus Rex', 'Struthiomimus',
# 'Velociraptor', 'Stegosaurus'}
I added a line to test whether the numerical values in each row are the same. You can modify this to test whether, for instance, the values are within some distance of each other:
import csv
f_d1 = open('test1.csv')
f_d2 = open('test2.csv')
f1_csv = csv.reader(f_d1)
f2_csv = csv.reader(f_d2)
for x,y in zip(f1_csv,f2_csv):
if x[1] == y[1]:
print('they match!')
Take advantage of the defaultdict in Python and you can iterate both the files and maintain the count in a dictionary like this
from collections import defaultdict
d = defaultdict(list)
for row in f1_csv:
d[row[0]].append(row[1])
for row in f2_csv:
d[row[0]].append(row[1])
d = {k: d[k] for k in d if len(d[k]) > 1}
print(d)
Output:
{'Hadrosaurus': ['1.2', '1.4'], 'Struthiomimus': ['0.92', '1.34'], 'Velociraptor': ['1.0', '2.72'],
'Euoplocephalus': ['1.6', '1.87'], 'Stegosaurus': ['1.4', '1.9'], 'Tyrannosaurus Rex': ['2.5', '5.76']}
I have data in columns of csv .I have an array from two columns of it.Iam using a List of list . I have string list like this
[[A,Bcdef],[Z,Wexy]
I want to identify duplicate entries i.e [A,Bcdef] and [A,Bcdef]
import csv
import StringIO
import os, sys
import hashlib
from collections import Counter
from collections import defaultdict
from itertools import takewhile, count
columns = defaultdict(list)
with open('person.csv','rU') as f:
reader = csv.DictReader(f) # read rows into a dictionary format
listoflists = [];
for row in reader: # read a row as {column1: value1, column2: value2,...}
a_list = [];
for (c,n) in row.items():
if c =="firstName":
try:
a_list.append(n[0])
except IndexError:
pass
for (c,n) in row.items():
if c=="lastName":
try:
a_list.append(n);
except IndexError:
pass
#print list(a_list);
listoflists.append(a_list);
#i += 1
print len(listoflists);
I have tried a couple of solutions proposed here
Using set (listoflist) always returns :unhashable type: 'list'
Functions : returns : 'list' object has no attribute 'values'
For example:
results = list(filter(lambda x: len(x) > 1, dict1.values()))
if len(results) > 0:
print('Duplicates Found:')
print('The following files are identical. the content is identical')
print('___________________')
for result in results:
for subresult in result:
print('\t\t%s' % subresult)
print('___________________')
else:
print('No duplicate files found.')
Any suggestions are welcomed.
Rather than lists, you can use tuples which are hashable.
You could build a set of the string representations of you lists, which are quite hashable.
l = [ ['A', "BCE"], ["B", "CEF"], ['A', 'BCE'] ]
res = []
dups = []
s = sorted(l, key=lambda x: x[0]+x[1])
previous = None
while s:
i = s.pop()
if i == previous:
dups.append(i)
else:
res.append(i)
previous = i
print res
print dups
Assuming you just want to get rid of duplicates and don't care about the order, you could turn your lists into strings, throw them into a set, and then turn them back into a list of lists.
foostrings = [x[0] + x[1] for x in listoflists]
listoflists = [[x[0], x[1:]] for x in set(foostrings)]
Another option, if you're going to be dealing with a bunch of tabular data, is to use pandas.
import pandas as pd
df = pd.DataFrame(listoflists)
deduped_df = df.drop_duplicates()