Creating variable number of lists from pandas dataframes - python

I have a pandas dataframe being generated by some other piece of code - the dataframe may have different number of columns each time it is generated: let's call them col1,col2,...,coln where n is not fixed. Please note that col1,col2,... are just placeholders, the actual names of columns can be arbitrary like TimeStamp or PrevState.
From this, I want to convert each column into a list, with the name of the list being the same as the column. So, I want a list named col1 with the entries in the first column of the dataframe and so on till coln.
How do I do this?
Thanks

It is not recommended, better is create dictionary:
d = df.to_dict('list')
And then select list by keys of dict from columns names:
print (d['col'])
Sample:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
})
d = df.to_dict('list')
print (d)
{'A': ['a', 'b', 'c', 'd', 'e', 'f'], 'B': [4, 5, 4, 5, 5, 4], 'C': [7, 8, 9, 4, 2, 3]}
print (d['A'])
['a', 'b', 'c', 'd', 'e', 'f']

import pandas as pd
df = pd.DataFrame()
df["col1"] = [1,2,3,4,5]
df["colTWO"] = [6,7,8,9,10]
for col_name in df.columns:
exec(col_name + " = " + df[col_name].values.__repr__())

Related

Get counts of unique lists in Pandas

I have a pandas Dataframe where one of the columns is full of lists:
import pandas
df = pandas.DataFrame([[1, [a, b, c]],
[2, [d, e, f]],
[3, [a, b, c]]])
And I'd like to make a pivot table that shows the list and a count of occurrences
List Count
[a,b,c] 2
[d,e,f] 1
Because list is a non-hashable type, what aggregate functions could do this?
You can zip a list of rows and a list of counts, then make a dataframe from the zip object:
import pandas
df = pandas.DataFrame([[1, ['a', 'b', 'c']],
[2, ['d', 'e', 'f']],
[3, ['a', 'b', 'c']]])
rows = []
counts = []
for index,row in df.iterrows():
if row[1] not in rows:
rows.append(row[1])
counts.append(1)
else:
counts[rows.index(row[1])] += 1
df = pandas.DataFrame(zip(rows, counts))
print(df)
The solution I ended up using was:
import pandas
df = pandas.DataFrame([[1, ['a', 'b', 'c']],
[2, ['d','e', 'f']],
[3, ['a', 'b', 'c']]])
print(df[1])
df[1] = df[1].map(tuple)
#Thanks Ch3steR
df2 = pandas.pivot_table(df,index=df[1], aggfunc='count')
print(df2)

Populate values in a dataframe based on matching row and column of another dataframe

I have two data-frames and I want to populate new column values in data-frame1 based on matching Zipcode and date from another data-frame2.
The sample input and desired output are given below. The date formats are not the same. Dataframe 1 has more than 100k records and data-frame2 has columns for every month.
Any suggestions would be of great help since I am a newbie to python.
you are looking for pd.merge. Here is an example which shows how you can use it.
df1 = pd.DataFrame({'x1': [1, 2, 3, 4, 5, 6],
'y': ['a', 'b', 'c', 'd', 'e', 'f']})
df2 = pd.DataFrame({'x2': [1, 2, 3, 4, 5, 6],
'y': ['h', 'i', 'j', 'k', 'l', 'm']})
pd.merge(df1, df2, left_on='x1', right_on='x2')

How to groupby the keys in dictionary and sum up the values in python?

How to groupby two keys in dictionary and get the sum of the values of the other key val.
Input:
data = {'key1':['a','a', 'b', 'b'], 'key2':['m','n', 'm', 'm'],
'val':[1, 2, 3, 4]}
In this example, I want to groupby the key1 and the key2, and then sum up the value in val.
Expected:
data = {'key1':['a','a', 'b', 'b'], 'key2':['m','n', 'm', 'm'],
'val':[1, 2, 3, 4], 'val_sum':[1, 2, 7, 7]}
Actually, I don't want to convert the dictionary data into pandas.DataFrame then convert back to dictionary to achieve it, because my data is actually very big.
Update:
To help understand the generating val_sum, I post my code using pandas.DataFrame.
df = pd.DataFrame(data)
tmp = df.groupby(['key1', 'key2'])['val'].agg({'val_sum':'sum'})
df['val_sum'] = df.set_index(['key1', 'key2']).index.map(tmp.to_dict()['val_sum'])
And the result is shown as follows:
key1 key2 val val_sum
0 a m 1 1
1 a n 2 2
2 b m 3 7
3 b m 4 7
You can build your own summing solution using a defaultdict, say as follows.
from collections import defaultdict
data = {'key1':['a','a', 'b', 'b'], 'key2':['m','n', 'm', 'm'],
'val':[1, 2, 3, 4]}
keys_to_group = ['key1','key2']
temp = defaultdict(int) #initializes sum to zero
for i, *key_group in zip(data['val'], *[data[key] for key in keys_to_group]):
print(i, key_group) #key_group now looks like ['a', 'm'] or ['b', 'm'] or so on
temp[tuple(key_group)] += i
val_sum = [temp[key_group] for key_group in zip(*[data[key] for key in keys_to_group])]
data['val_sum'] = val_sum
print(data)
{'key1': ['a', 'a', 'b', 'b'],
'key2': ['m', 'n', 'm', 'm'],
'val': [1, 2, 3, 4],
'val_sum': [1, 2, 7, 7]}
Having said that however, it does seem like your data is more suited for tabular structures, and if you plan to do more than just this one operation, it might make sense to load it up in a dataframe anyways.

Lowercase columns by name using dataframe method

I have a dataframe containing strings and NaNs. I want to str.lower() certain columns by name to_lower = ['b', 'd', 'e']. Ideally I could do it with a method on the whole dataframe, rather than with a method on df[to_lower]. I have
df[to_lower] = df[to_lower].apply(lambda x: x.astype(str).str.lower())
but I would like a way to do it without assigning to the selected columns.
df = pd.DataFrame({'a': ['A', 'a'], 'b': ['B', 'b']})
to_lower = ['a']
df2 = df.copy()
df2[to_lower] = df2[to_lower].apply(lambda x: x.astype(str).str.lower())
You can use assign method and unpack the result as keyword argument:
df = pd.DataFrame({'a': ['A', 'a'], 'b': ['B', 'b'], 'c': ['C', 'c']})
to_lower = ['a', 'b']
df.assign(**df[to_lower].apply(lambda x: x.astype(str).str.lower()))
# a b c
#0 a b C
#1 a b c
You want this:
for column in to_lower:
df[column] = df[column].str.lower()
This is far more efficient assuming you have more rows than columns.

How can I change the original DataFrame from a group?

Let's suppose I have the following DataFrame:
import pandas as pd
df = pd.DataFrame({'label': ['a', 'a', 'b', 'b', 'a', 'b', 'c', 'c', 'a', 'a'],
'numbers': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
'arbitrarydata': [False] * 10})
I want to assign a value to the arbitrarydata column according to the values in both of the other colums. A naive approach would be as follows:
for _, grp in df.groupby(('label', 'numbers')):
grp.arbitrarydata = pd.np.random.rand()
Naturally, this doesn't propagate changes back to df. Is there a way to modify a group such that changes are reflected in the original DataFrame ?
Try using transform, e.g.:
df['arbitrarydata'] = df.groupby(('label', 'numbers')).transform(lambda x: np.random.rand())

Categories

Resources