This question already has answers here:
Convert Python dict into a dataframe
(18 answers)
Closed 2 years ago.
{'student1': 45,
'student2': 78,
'student3': 12,
'student4': 14,
'student5': 48,
'student6': 43,
'student7': 47,
'student8': 98,
'student9': 35,
'student10': 80}
How to convert this dict into a dataframe
import pandas as pd
student = {
"student1": 45,
"student2": 78,
"student3": 12,
"student4": 14,
"student5": 48,
"student6": 43,
"student7": 47,
"student8": 98,
"student9": 35,
"student10": 80,
}
df = pd.DataFrame(student.items(), columns=["name", "score"])
print(df)
name score
0 student1 45
1 student2 78
2 student3 12
3 student4 14
4 student5 48
5 student6 43
6 student7 47
7 student8 98
8 student9 35
9 student10 80
import pandas as pd
# intialise data of lists. where each key will be your column
data = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]}
# Create DataFrame
df = pd.DataFrame(data)
# or list of dicts
data = [{'a': 1, 'b': 2, 'c':3}, {'a':10, 'b': 20, 'c': 30}]
if you are getting scalar error
do this
import pandas as pd
data = {'student1': 45, 'student2': 78, 'student3': 12, 'student4': 14, 'student5': 48, 'student6': 43, 'student7': 47, 'student8': 98, 'student9': 35, 'student10': 80}
for i in data.keys():
data[i] = [data[i]]
df = pd.DataFrame(data)
df.head()
This should do the trick
df = DataFrame(list(my_dict.items()),columns = ['column1','column2'])
pd.DataFrame(dict_.items())
pd.DataFrame(dict_.items(), columns=['Student', 'Point'])
pd.Series(dict_, name='StudentValue')
All will work.
Related
This is similar to previous questions about how to expand a list-based column across several columns, but the solutions I'm seeing don't seem to work for Dask. Note, that the true DFs I'm working with are too large to hold in memory, so converting to pandas first is not an option.
I have a df with column that contains lists:
df = pd.DataFrame({'a': [np.random.randint(100, size=4) for _ in range(20)]})
dask_df = dd.from_pandas(df, chunksize=10)
dask_df['a'].compute()
0 [52, 38, 59, 78]
1 [79, 71, 13, 63]
2 [15, 81, 79, 76]
3 [53, 4, 94, 62]
4 [91, 34, 26, 92]
5 [96, 1, 69, 27]
6 [84, 91, 96, 68]
7 [93, 56, 45, 40]
8 [54, 1, 96, 76]
9 [27, 11, 79, 7]
10 [27, 60, 78, 23]
11 [56, 61, 88, 68]
12 [81, 10, 79, 65]
13 [34, 49, 30, 3]
14 [32, 46, 53, 62]
15 [20, 46, 87, 31]
16 [89, 9, 11, 4]
17 [26, 46, 19, 27]
18 [79, 44, 45, 56]
19 [22, 18, 31, 90]
Name: a, dtype: object
According to this solution, if this were a pd.DataFrame I could do something like this:
new_dask_df = dask_df['a'].apply(pd.Series)
ValueError: The columns in the computed data do not match the columns in the provided metadata
Extra: [1, 2, 3]
Missing: []
There's another solution listed here:
import dask.array as da
import dask.dataframe as dd
x = da.ones((4, 2), chunks=(2, 2))
df = dd.io.from_dask_array(x, columns=['a', 'b'])
df.compute()
So for dask I tried:
df = dd.io.from_dask_array(dask_df.values)
but that just spits out the same DF I have from before:
[1]: https://i.stack.imgur.com/T099A.png
Not really sure why as the types between the example 'x' and the values in my df are the same:
print(type(dask_df.values), type(x))
<class 'dask.array.core.Array'> <class 'dask.array.core.Array'>
print(type(dask_df.values.compute()[0]), type(x.compute()[0]))
<class 'numpy.ndarray'> <class 'numpy.ndarray'>
Edit: I kind of having a working solution but it involves iterating through each groupby object. It feels like there should be a better way:
dask_groups = dask_df.explode('a').reset_index().groupby('index')
final_df = []
for idx in dask_df.index.values.compute():
group = dask_groups.get_group(idx).drop(columns='index').compute()
group_size = list(range(len(group)))
row = group.transpose()
row.columns = group_size
row['index'] = idx
final_df.append(dd.from_pandas(row, chunksize=10))
final_df = dd.concat(final_df).set_index('index')
In this case dask doesn't know what to expect from the outcome, so it's best to specify meta explicitly:
# this is a short-cut to use the existing pandas df
# in actual code it is sufficient to provide an
# empty series with the expected dtype
meta = df['a'].apply(pd.Series)
new_dask_df = dask_df['a'].apply(pd.Series, meta=meta)
new_dask_df.compute()
I got a working solution. My original function created a list which resulted in the column of lists, as above. Changing the applied function to return a dask bag seems to do the trick:
def create_df_row(x):
vals = np.random.randint(2, size=4)
return db.from_sequence([vals], partition_size=2).to_dataframe()
test_df = dd.from_pandas(pd.DataFrame({'a':[random.choice(['a', 'b', 'c']) for _ in range(20)]}), chunksize=10)
test_df.head()
mini_dfs = [*test_df.groupby('a')['a'].apply(lambda x: create_df_row(x))]
result = dd.concat(mini_dfs)
result.compute().head()
But not sure if this solves the in-memory issue as now i'm holding a list of groupby results.
Here's how to expand a list-like column across multiple columns manually:
dask_df["a0"] = dask_df["a"].str[0]
dask_df["a1"] = dask_df["a"].str[1]
dask_df["a2"] = dask_df["a"].str[2]
dask_df["a3"] = dask_df["a"].str[3]
print(dask_df.head())
a a0 a1 a2 a3
0 [71, 16, 0, 10] 71 16 0 10
1 [59, 65, 99, 74] 59 65 99 74
2 [83, 26, 33, 38] 83 26 33 38
3 [70, 5, 19, 37] 70 5 19 37
4 [0, 59, 4, 80] 0 59 4 80
SultanOrazbayev's answer seems more elegant.
Input Dataframe as below
data = {
's_id' :[5,7,26,70.0,55,71.0,8.0,'nan','nan',4],
'r_id' : [[34, 44, 23, 11, 71], [53, 33, 73, 41], [17], [10, 31], [17], [75, 8],[7],[68],[50],[]]
}
df = pd.DataFrame.from_dict(data)
df
Out[240]:
s_id r_id
0 5 [34, 44, 23, 11, 71]
1 7 [53, 33, 73, 41]
2 26 [17]
3 70 [10, 31]
4 55 [17]
5 71 [75, 8]
6 8 [7]
7 nan [68]
8 nan [50]
9 4 []
Expected dataframe
data = {
's_id' :[5,7,26,70.0,55,71.0,8.0,'nan','nan',4],
'r_id' : [[5,34, 44, 23, 11, 71], [7,53, 33, 73, 41], [26,17], [70,10, 31], [55,17], [71,75, 8],[8,7],[68],[50],[4]]
}
df = pd.DataFrame.from_dict(data)
df
Out[241]:
s_id r_id
0 5 [5, 34, 44, 23, 11, 71]
1 7 [7, 53, 33, 73, 41]
2 26 [26, 17]
3 70 [70, 10, 31]
4 55 [55, 17]
5 71 [71, 75, 8]
6 8 [8, 7]
7 nan [68]
8 nan [50]
9 4 [4]
Need to populate the list column with the elements from S_id as the first element in the list column of r_id, I also have nan values and some of them are appearing as float columns, Thanking you.
I tried the following,
df['r_id'] = df["s_id"].apply(lambda x : x.append(df['r_id']) )
df['r_id'] = df["s_id"].apply(lambda x : [x].append(df['r_id'].values.tolist()))
If nans are missing values use apply with convert value to one element list with converting to integers and filter for omit mising values:
data = {
's_id' :[5,7,26,70.0,55,71.0,8.0,np.nan,np.nan,4],
'r_id' : [[34, 44, 23, 11, 71], [53, 33, 73, 41],
[17], [10, 31], [17], [75, 8],[7],[68],[50],[]]
}
df = pd.DataFrame.from_dict(data)
print (df)
f = lambda x : [int(x["s_id"])] + x['r_id'] if pd.notna(x["s_id"]) else x['r_id']
df['r_id'] = df.apply(f, axis=1)
print (df)
s_id r_id
0 5.0 [5, 34, 44, 23, 11, 71]
1 7.0 [7, 53, 33, 73, 41]
2 26.0 [26, 17]
3 70.0 [70, 10, 31]
4 55.0 [55, 17]
5 71.0 [71, 75, 8]
6 8.0 [8, 7]
7 NaN [68]
8 NaN [50]
9 4.0 [4]
Another idea is filter column and apply function to non NaNs rows:
m = df["s_id"].notna()
f = lambda x : [int(x["s_id"])] + x['r_id']
df.loc[m, 'r_id'] = df[m].apply(f, axis=1)
print (df)
s_id r_id
0 5.0 [5, 34, 44, 23, 11, 71]
1 7.0 [7, 53, 33, 73, 41]
2 26.0 [26, 17]
3 70.0 [70, 10, 31]
4 55.0 [55, 17]
5 71.0 [71, 75, 8]
6 8.0 [8, 7]
7 NaN [68]
8 NaN [50]
9 4.0 [4]
I want to convert all columns (59 columns) from my excel file to a dataframe, specifying the types.
Some columns are a string, others dates, other int and more.
I know I can use a converter in a read_excel method.
but I have a lot of columns and I don't want write converter={'column1': type1, 'column2': type2, ..., 'column59': type59}
my code is:
import numpy as np
import pandas as pd
import recordlinkage
import xrld
fileName = 'C:/Users/Tito/Desktop/banco ZIKA4.xlsx'
strcols = [0, 5, 31, 36, 37, 38, 39, 40, 41, 45]
datecols = [3, 4, 29, 30, 32, 48, 50, 51, 52, 53, 54, 55]
intcols = [33, 43, 59]
booleancols = [6, ..., 28]
df = pd.read_excel(fileName, sheet_name=0, true_values=['s'], false_values=['n'], converters={strcols: str, intcols: np.int, booleancols: np.bool, datecols: pd.to_datetime})
print(df.iat[1, 31], df.iat[1, 32], df.iat[1, 33])
Iiuc your code doesn't work because the converters kwarg doesn't allow lists of several columns as keys to functions.
What you can do is to create dicts instead of lists and provide the concatenated dicts to converters:
strcols = {c: str for c in [0, 5, 31, 36, 37, 38, 39, 40, 41, 45]}
datecols = {c: pd.to_datetime for c in [3, 4, 29, 30, 32, 48, 50, 51, 52, 53, 54, 55]}
intcols = {c: np.int for c in [33, 43, 59]}
booleancols = {c: np.bool for c in range(6, 29)}
conv_fcts = {**strcols, **datecols, **intcols, **booleancols}
df = pd.read_excel(fileName, converters=conv_fcts, sheet_name=0, true_values=['s'], false_values=['n'])
#code source
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=50,
n_features=6,
n_informative=3,
n_classes=2,
random_state=10,
shuffle=True)
# Creating a dataFrame
df = pd.DataFrame({'Feature 1':X[:,0],
'Feature 2':X[:,1],
'Feature 3':X[:,2],
'Feature 4':X[:,3],
'Feature 5':X[:,4],
'Feature 6':X[:,5],
'Class':y})
values = [i for i,x in enumerate(df['Class']) if x == 0]
print(values)
The output is
[5, 6, 9, 11, 13, 14, 17, 18, 20, 21, 23, 24, 25, 26, 27, 31, 32, 34,
41, 42, 44, 45, 46, 47, 49]
I am trying to group the above output based on the condition that numbers come in concurrent value . Such as the output should be:
Group 1: 5,6
Group 2: 9
Group 3: 11
Group 4: 13,14
..
..
Group n: 23,24,25,26,27
I am grouping them to have an understanding of the gaps in the column, instead of having a slab of values following each other in a list.
I think need Series, get differences by diff, compare by gt and last create groups by cumsum to new Series which is used as by argument of groupby:
values = [5, 6, 9, 11, 13, 14, 17, 18, 20, 21, 23,
24, 25, 26, 27, 31, 32, 34, 41, 42, 44, 45, 46, 47, 49]
s = pd.Series(values)
s1 = s.groupby(s.diff().gt(1).cumsum() + 1).apply(lambda x: ','.join(x.astype(str)))
print (s1)
1 5,6
2 9
3 11
4 13,14
5 17,18
6 20,21
7 23,24,25,26,27
8 31,32
9 34
10 41,42
11 44,45,46,47
12 49
dtype: object
When I run the following code:
import pandas as pd
web_states = {'Day':[1,2,3,4,5,6],
'Visitors': [43,53,46,78,88,24],
'BounceRates':[65,74,99,98,45,56]}
df= pd.DataFrame(web_states)
print(df)
I get the following error:
File "C:\Users\Python36-32\lib\site-packages\numpy\core_init_.py",
line 16, in from . import multiarray SystemError:
initialization of multiarray raised unreported exception
Please advise.
Bounce Rates is too short.
Your code:
web_states = {'Day': [1, 2, 3, 4, 5, 6],
'Visitors': [43, 53, 46, 78, 88, 24],
'BounceRates': [65, 74, 99, 98, 45]}
df = pd.DataFrame(web_states)
Produces:
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 5446, in extract_index
raise ValueError('arrays must all be same length')
ValueError: arrays must all be same length
Lengthen BounceRates:
web_states = {'Day': [1, 2, 3, 4, 5, 6],
'Visitors': [43, 53, 46, 78, 88, 24],
'BounceRates': [65, 74, 99, 98, 45, 0]}
df = pd.DataFrame(web_states)
print(df)
Produces:
BounceRates Day Visitors
0 65 1 43
1 74 2 53
2 99 3 46
3 98 4 78
4 45 5 88
5 0 6 24