I have a list that is created using two columns of a dataframe. I need to create a dictionary where keys will be the elements of the list and the values will be the elements of a column in the dataframe. Below is an Example that I just created. The dataframe I am using is large and so is the list.
data={'init':[1,2,1], 'term':[2,3,3], 'cost':[10,20,30]}
df=pd.DataFrame.from_dict(data)
link=[(1,2),(1,3),(2,3) ]
I need to create the following dictionary using the dataframe and list.
link_cost={(1,2): 10,(1,3):30,(2,3):20,}
Could anyone help me with this? Any comments or instruction would be appreciated.
Let's try set_index + reindex then Series.to_dict:
d = df.set_index(['init', 'term'])['cost'].reindex(index=link).to_dict()
d:
{(1, 2): 10, (1, 3): 30, (2, 3): 20}
set_index with multiple columns will create a MultiIndex which can be indexed with tuples. Selecting a specific column and then reindexing will allow the list link to reorder/select specific values from the Series. Series.to_dict will create the dictionary output.
Setup used:
import pandas as pd
df = pd.DataFrame({
'init': [1, 2, 1],
'term': [2, 3, 3],
'cost': [10, 20, 30]
})
link = [(1, 2), (1, 3), (2, 3)]
Why are you even using pandas for this? You have the dict right there:
link_cost = dict(zip(link, data['cost']}}
# or if you must use the dataframe it's the same
link_cost = dict(zip(link, df['cost']}}
{(1,2): 10, (2,3):20, (1,3):30}
One approach is to use DataFrame.set_index, Index.isin and DataFrame.itertuples:
import pandas as pd
data = {'init': [1, 2, 1], 'term': [2, 3, 3], 'cost': [10, 20, 30]}
df = pd.DataFrame.from_dict(data)
link = [(1, 2), (2, 3), (1, 3)]
cols = ["init", "term"]
new = df.set_index(cols)
res = dict(new[new.index.isin(link)].itertuples(name=None))
print(res)
Output
{(1, 2): 10, (2, 3): 20, (1, 3): 30}
Related
This is my task:
Write a function that accepts a dataframe as input, the name of the column with missing values , and a list of grouping columns and returns the dataframe by filling in missing values with the median value
Here is that I tried to do:
def fillnull(set,col):
val = {col:set[col].sum()/set[col].count()}
set.fillna(val)
return set
fillnull(titset,'Age')
My problem is that my function doesn't work, also I don't know how to count median and how to group through this function
Here are photos of my dataframe and missing values of my dataset
DATAFRAME
NaN Values
Check does this code works for you
import pandas as pd
df = pd.DataFrame({
'processId': range(100, 900, 100),
'groupId': [1, 1, 2, 2, 3, 3, 4, 4],
'other': [1, 2, 3, None, 3, 4, None, 9]
})
print(df)
def fill_na(df, missing_value_col, grouping_col):
values = df.groupby(grouping_col)[missing_value_col].median()
df.set_index(grouping_col, inplace=True)
df[missing_value_col].fillna(values, inplace=True)
df.reset_index(grouping_col, inplace=True)
return df
fill_na(df, 'other', 'groupId')
What I have is a basic dataframe that I want to pull two values out of, based on index position. So for this:
first_column
second_column
1
1
2
2
3
3
4
4
5
5
I want to extract the values in row 1 and row 2 (1 2) out of first_column, then extract values in row 2 and row 3 (2 3) out of the first_column, so on and so forth until I've iterated over the entire column. I ran into an issue with the four loop and am stuck with getting the next index value.
I have code like below:
import pandas as pd
data = {'first_column': [1, 2, 3, 4, 5],
'second_column': [1, 2, 3, 4, 5],
}
df = pd.DataFrame(data)
for index, row in df.iterrows():
print(index, row['first_column']) # value1
print(index + 1, row['first_column'].values(index + 1)) # value2 <-- error in logic here
Ignoring the prints, which will eventually become variables that are returned, how can I improve this to return (1 2), (2 3), (3 4), (4 5), etc.?
Also, is this easier done with iteritems() method instead of iterrows?
Not sure if this is what you want to achieve:
(temp= df.assign(second_column = df.second_column.shift(-1))
.dropna()
.assign(second_column = lambda df: df.second_column.astype(int))
)
[*zip(temp.first_column.array, temp.second_column.array)]
[(1, 2), (2, 3), (3, 4), (4, 5)]
A simpler solution from #HenryEcker:
list(zip(df['first_column'], df['first_column'].iloc[1:]))
I don't know if this answers your question, but maybe you can try this :
for i, val in enumerate(df['first_column']):
if val+1>5:
break
else:
print(val,", ",val+1)
If you Want to take these items in the same fashion you should consider using iloc instead of using iterrow.
out = []
for i in range(len(df)-1):
print(i, df.iloc[i]["first_column"])
print(i+1, df.iloc[i+1]["first_column"])
out.append((df.iloc[i]["first_column"],
df.iloc[i+1]["first_column"]))
print(out)
[(1, 2), (2, 3), (3, 4), (4, 5)]
I have a Dataframe which I want to transform into a multidimensional array using one of the columns as the 3rd dimension.
As an example:
df = pd.DataFrame({
'id': [1, 2, 2, 3, 3, 3],
'date': np.random.randint(1, 6, 6),
'value1': [11, 12, 13, 14, 15, 16],
'value2': [21, 22, 23, 24, 25, 26]
})
I would like to transform it into a 3D array with dimensions (id, date, values) like this:
The problem is that the 'id's do not have the same number of occurrences so I cannot use np.reshape().
For this simplified example, I was able to use:
ra = np.full((3, 3, 3), np.nan)
for i, value in enumerate(df['id'].unique()):
rows = df.loc[df['id'] == value].shape[0]
ra[i, :rows, :] = df.loc[df['id'] == value, 'date':'value2']
To produce the needed result:
but the original DataFrame contains millions of rows.
Is there a vectorized way to accomplice the same result?
Approach #1
Here's one vectorized approach after sorting id col with df.sort_values('id', inplace=True) as suggested by #Yannis in comments -
count_id = df.id.value_counts().sort_index().values
mask = count_id[:,None] > np.arange(count_id.max())
vals = df.loc[:, 'date':'value2'].values
out_shp = mask.shape + (vals.shape[1],)
out = np.full(out_shp, np.nan)
out[mask] = vals
Approach #2
Another with factorize that doesn't require any pre-sorting -
x = df.id.factorize()[0]
y = df.groupby(x).cumcount().values
vals = df.loc[:, 'date':'value2'].values
out_shp = (x.max()+1, y.max()+1, vals.shape[1])
out = np.full(out_shp, np.nan)
out[x,y] = vals
I would like to drop all data in a pandas dataframe, but am getting TypeError: drop() takes at least 2 arguments (3 given). I essentially want a blank dataframe with just my columns headers.
import pandas as pd
web_stats = {'Day': [1, 2, 3, 4, 2, 6],
'Visitors': [43, 43, 34, 23, 43, 23],
'Bounce_Rate': [3, 2, 4, 3, 5, 5]}
df = pd.DataFrame(web_stats)
df.drop(axis=0, inplace=True)
print df
You need to pass the labels to be dropped.
df.drop(df.index, inplace=True)
By default, it operates on axis=0.
You can achieve the same with
df.iloc[0:0]
which is much more efficient.
My favorite:
df = df.iloc[0:0]
But be aware df.index.max() will be nan.
To add items I use:
df.loc[0 if math.isnan(df.index.max()) else df.index.max() + 1] = data
My favorite way is:
df = df[0:0]
Overwrite the dataframe with something like that
import pandas as pd
df = pd.DataFrame(None)
or if you want to keep columns in place
df = pd.DataFrame(columns=df.columns)
If your goal is to drop the dataframe, then you need to pass all columns. For me: the best way is to pass a list comprehension to the columns kwarg. This will then work regardless of the different columns in a df.
import pandas as pd
web_stats = {'Day': [1, 2, 3, 4, 2, 6],
'Visitors': [43, 43, 34, 23, 43, 23],
'Bounce_Rate': [3, 2, 4, 3, 5, 5]}
df = pd.DataFrame(web_stats)
df.drop(columns=[i for i in check_df.columns])
This code make clean dataframe:
df = pd.DataFrame({'a':[1,2], 'b':[3,4]})
#clean
df = pd.DataFrame()
I want to iterate over a numpy array starting at the index of the highest value working through to the lowest value
import numpy as np #imports numpy package
elevation_array = np.random.rand(5,5) #creates a random array 5 by 5
print elevation_array # prints the array out
ravel_array = np.ravel(elevation_array)
sorted_array_x = np.argsort(ravel_array)
sorted_array_y = np.argsort(sorted_array_x)
sorted_array = sorted_array_y.reshape(elevation_array.shape)
for index, rank in np.ndenumerate(sorted_array):
print index, rank
I want it to print out:
index of the highest value
index of the next highest value
index of the next highest value etc
If you want numpy doing the heavy lifting, you can do something like this:
>>> a = np.random.rand(100, 100)
>>> sort_idx = np.argsort(a, axis=None)
>>> np.column_stack(np.unravel_index(sort_idx[::-1], a.shape))
array([[13, 62],
[26, 77],
[81, 4],
...,
[83, 40],
[17, 34],
[54, 91]], dtype=int64)
You first get an index that sorts the whole array, and then convert that flat index into pairs of indices with np.unravel_index. The call to np.column_stack simply joins the two arrays of coordinates into a single one, and could be replaced by the Python zip(*np.unravel_index(sort_idx[::-1], a.shape)) to get a list of tuples instead of an array.
Try this:
from operator import itemgetter
>>> a = np.array([[2, 7], [1, 4]])
array([[2, 7],
[1, 4]])
>>> sorted(np.ndenumerate(a), key=itemgetter(1), reverse=True)
[((0, 1), 7),
((1, 1), 4),
((0, 0), 2),
((1, 0), 1)]
you can iterate this list if you so wish. Essentially I am telling the function sorted to order the elements of np.ndenumerate(a) according to the key itemgetter(1). This function itemgetter gets the second (index 1) element from the tuples ((0, 1), 7), ((1, 1), 4), ... (i.e the values) generated by np.ndenumerate(a).