Python: How can I manipulate csv data in a list? - python

I have created a list which has the totality of all the data in the csv file.
How do I seperately call upon data in rows and columns?
For instance:
**a, b ,c**
**1** a1 b1 c1
**2** a2 b2 c2
How can I identify a single cell within the list?

try below code:
l = ['a', 'b', 'c','1','a1', 'b1', 'c1', '2', 'a2', 'b2','c2']
columns = 3
result = list(zip(*[iter(l[columns:])]*(columns+1)))
result2 = {i[0]:i[1:] for i in result}
item_id = '2'
result2[item_id]
output:
('a2', 'b2', 'c2')
or you could try below code:
l = ['a', 'b', 'c','1','a1', 'b1', 'c1', '2', 'a2', 'b2','c2']
columns = 3
item_id = '2'
index = l.index(item_id)
l[index:index+columns]
output:
['a2', 'b2', 'c2']

Related

Zipping dictionary to pandas [duplicate]

This question already has answers here:
Dictionary of lists to dataframe
(5 answers)
Closed 1 year ago.
I am trying to zip my dictionary into a panda's data frame, and do it by the keys that are in the dictionary and not manually:
import pandas as pd
dict = {'A': ['a1', 'a2', 'a3'], 'B': ['b1', 'b2', 'b3']}
columns = list(dict.keys()) # ['A', 'B']
manual_results = list(zip(dict['A'], dict['B'])) # [('a1', 'b1'), ('a2', 'b2'), ('a3', 'b3')]
df = pd.DataFrame(manual_results, columns=columns)
I wish to create the results without the need to explicitly write the name of each key (dict['A'], dict['B'] etc). Any Ideas?
There is no need to zip it. Pandas can create a dataframe directly from a dict:
import pandas as pd
d = {'A': ['a1', 'a2', 'a3'], 'B': ['b1', 'b2', 'b3']}
df = pd.DataFrame.from_dict(d)
print(df)
A B
0 a1 b1
1 a2 b2
2 a3 b3
reference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.from_dict.html
Note: You can also orient it the other way (so the dict keys become the row index instead of colums) ...
import pandas as pd
d = {'A': ['a1', 'a2', 'a3'], 'B': ['b1', 'b2', 'b3']}
df = pd.DataFrame.from_dict(d,orient='index')
print(df)
0 1 2
A a1 a2 a3
B b1 b2 b3
There is no need to use zip() as pd.DataFrame natively expect the parameter data to be a dict that can contain Series, arrays, etc.
You can simply do as follows:
import pandas as pd
d = {'A': ['a1', 'a2', 'a3'], 'B': ['b1', 'b2', 'b3']}
df = pd.DataFrame(d)
Which output:
A B
0 a1 b1
1 a2 b2
2 a3 b3

Iterate over different length df & export to csv

When my df2, Group2 has the third item 'B3', I get what I want regarding the groupby. How can I get the output when the arrays are different lengths?
I also struggle with getting all data to CSV, not just the last iteration. I tried making the df before the loop and then merging it within, but something doesn't work.
import pandas as pd
df1 = pd.DataFrame({'Title': ['A1', 'A2', 'A3', 'B1', 'B2', 'C13'],
'Whole': ['full', 'full', 'full', 'semi', 'semi', 'semi']})
df2 = pd.DataFrame({'Group1': ['A1', 'A2', 'A3'],
'Group2': ['B1', 'B2']})
for column in df2.columns:
d_group = df1[df1.Title.isin(df2[column])]
df = d_group.groupby('Whole')['Whole'].count()\
.rename('Column Name from df2')\
.reindex(['part', 'full', 'semi'], fill_value='-')\
.reset_index()
df.T.to_csv('all_groups2.csv', header=False, index=True)
print(df.T)
Desired output:
Whole | part | full | semi
--------+---------+----------+----------
Group1 | - | 3 | -
Group2 | - | - | 2
In Pandas Dataframe, it is expected to have columns (or rows) with the same shapes. Therefore, it is not possible to have df2 in your code.
I recommend using the Series instead, something like this:
df1 = pd.DataFrame({'Title': ['A1', 'A2', 'A3', 'B1', 'B2', 'C13'],
'Whole': ['full', 'full', 'full', 'semi', 'part', 'semi']})
group1 = pd.Series(['A1', 'A2', 'A3'])
group2 = pd.Series(['B1', 'B2'])
Then, you can filter and groupby dataframe df1 by isin function:
dfg1 = df1[df1['Title'].isin(group1)].groupby('Whole').count()
dfg2 = df1[df1['Title'].isin(group2)].groupby('Whole').count()
And finally join them by concat on axis=1:
res = pd.concat([dfg1, dfg2], axis=1)
res.columns = ['Group1','Group2']
finaldf = res.T
The result is the following:
full part semi
Group1 3.0 NaN NaN
Group2 NaN 1.0 1.0
And finally, you can write it to a CSV with the same code that you had:
finaldf.to_csv('result.csv', header=False, index=True)
I recommend not writing row by row to a file, unless it is a very huge file and you cannot store it in memory. In that case, I recommend partitioning or using Dask.
I just realized I could just load my df2 as pd.series, and iterate over the index, not column, to get where I wanted to be.
import pandas as pd
df1 = pd.DataFrame({'Title': ['A1', 'A2', 'A3', 'C1', 'C2', 'C3'],
'ID': ['B1', 'B2', 'B3', 'A1', 'D2', 'D3'],
'Whole': ['full', 'full', 'full', 'semi', 'semi', 'semi']})
df2 = pd.Series({'Group1': ['A1', 'A2', 'A3'],
'Group2': ['B1', 'B2']})
df = pd.DataFrame()
for index in df2.index:
d_group = (df1[df1.ID.isin(df2[index])])
df3 = d_group.groupby('Whole')['Whole'].count()\
.rename(index, inplace=True)\
.reindex(['part', 'full', 'semi'], fill_value='-')
df = df.append(df3, ignore_index=False, sort=False)
print(df)

Remove element from every list in a column in pandas dataframe based on another column

I'd like to remove values in list from column B based on column A, wondering how.
Given:
df = pd.DataFrame({
'A': ['a1', 'a2', 'a3', 'a4'],
'B': [['a1', 'a2'], ['a1', 'a2', 'a3'], ['a1', 'a3'], []]
})
I want:
result = pd.DataFrame({
'A': ['a1', 'a2', 'a3', 'a4'],
'B': [['a1', 'a2'], ['a1', 'a2', 'a3'], ['a1', 'a3'], []],
'Output': [['a2'], ['a1', 'a3'], ['a1'], []]
})
One way of doing that is applying a filtering function to each row via DataFrame.apply:
df['Output'] = df.apply(lambda x: [i for i in x.B if i != x.A], axis=1)
Another solution using iterrows():
for i,value in df.iterrows():
try:
value['B'].remove(value['A'])
except ValueError:
pass
print(df)
Output:
A B
0 a1 [a2]
1 a2 [a1, a3]
2 a3 [a1]
3 a4 []

Compare nested list values within columns of a dataframe

How can I compare lists within two columns of a dataframe and identify if the elements of one list is within the other list and create another column with the missing elements.
The dataframe looks something like this:
df = pd.DataFrame({'A': ['a1', 'a2', 'a3'],
'B': [['b1', 'b2'], ['b1', 'b2', 'b3'], ['b2']],
'C': [['c1', 'b1'], ['b3'], ['b2', 'b1']],
'D': ['d1', 'd2', 'd3']})
I want to compare if elements of column C are in column B and output the missing values to column E, the desired output is:
df = pd.DataFrame({'A': ['a1', 'a2', 'a3'],
'B': [['b1', 'b2'], ['b1', 'b2', 'b3'], ['b2']],
'C': [['c1', 'b1'], ['b3'], ['b2', 'b1']],
'D': ['d1', 'd2', 'd3']
'E': ['b2', ['b1','b2'],'']})
Like your previous related question, you can use a list comprehension. As a general rule, you shouldn't force multiple different types of output, e.g. list or str, depending on result. Therefore, I have chosen lists throughout in this solution.
df['E'] = [list(set(x) - set(y)) for x, y in zip(df['B'], df['C'])]
print(df)
A B C D E
0 a1 [b1, b2] [c1, b1] d1 [b2]
1 a2 [b1, b2, b3] [b3] d2 [b1, b2]
2 a3 [b2] [b2, b1] d3 []
def Desintersection(i):
Output = [b for b in df['B'][i] if b not in df['C'][i]]
if(len(Output) == 0):
return ''
elif(len(Output) == 1):
return Output[0]
else:
return Output
df['E'] = df.index.map(Desintersection)
df
Like what I do for my previous answer
(df.B.map(set)-df.C.map(set)).map(list)
Out[112]:
0 [b2]
1 [b2, b1]
2 []
dtype: object
I agree with #jpp that you shouldn't mix the types so much, as when you try to apply the same function to the new E column, it will fail, cause it expected each element to be a list.
This would work on E, as it converts single str values to [str] before comparison.
import pandas as pd
df = pd.DataFrame({'A': ['a1', 'a2', 'a3'],
'B': [['b1', 'b2'], ['b1', 'b2', 'b3'], ['b2']],
'C': [['c1', 'b1'], ['b3'], ['b2', 'b1']],
'D': ['d1', 'd2', 'd3']})
def difference(df, A, B):
elements_to_list = lambda x: [n if isinstance(n, list) else [n] for n in x]
diff = [list(set(a).difference(set(b))) for a, b in zip(elements_to_list(df[A]), elements_to_list(df[B]))]
diff = [d if d else "" for d in diff] # replace empty lists with empty strings
return [d if len(d) != 1 else d[0] for d in diff] # return with single values extracted from the list
df['E'] = difference(df, "B", "C")
df['F'] = difference(df, "B", "E")
print(list(df['E']))
print(list(df['F']))
['b2', ['b2', 'b1'], '']
['b1', 'b3', 'b2']

Convert dictionary to list with some data omitted

I'm trying to convert a dictionary of the format:
d = {'A1': ['a', 'a', 'A2 (A3-)', 'a'],
'B1': ['b', 'b', 'B2 (B3-)', 'b'],
'C1': ['c', 'c', 'C2 (C3)-', 'c']}
To a list of the form:
e = [['A1', 'A2', 'A3'], ['B1', 'B2', 'B3'], ['C1', 'C2', 'C3']]
I know I should use regex to get the A2 and A3 data, but I'm having trouble putting this all together...
import re
regex = re.compile(r'(\w+) \((\w+)-.*')
# I suppose that you meant (C3-) and not (C3)-
d = {'A1': ['a', 'a', 'A2 (A3-)', 'a'], 'B1': ['b', 'b', 'B2 (B3-)', 'b'], 'C1': ['c', 'c', 'C2 (C3-)', 'c']}
out = []
for key, values_list in d.items():
v2, v3 = regex.match(values_list[2]).groups()
out.append([key, v2, v3])
print(out)
# [['C1', 'C2', 'C3'], ['B1', 'B2', 'B3'], ['A1', 'A2', 'A3']]
Note that the order is random, as your original dict is unordered.

Categories

Resources