I am trying to drop multiple rows from my data.
I can drop rows using:
dt=dt.drop([40,41,42,43,44,45])
But I was wondering if there is a simpler way. I tried:
dt=dt.drop([40:45])
But sadly it did not work.
I will recommend np.r_
df.drop(np.r_[40:50+1])
In case you want to drop two range at the same time
np.r_[40:50+1,1:4+1]
Out[719]: array([40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 1, 2, 3, 4])
Assuming you want to drop a range of positions:
df.drop(df.index[40: 46])
This doesn't assume the indices are integers.
You can use:
dt = dt.drop(range(40,46))
or
dt.drop(range(40,46), inplace=True)
You could generate the list based on a range:
dt=dt.drop([x for x in range(40, 46)])
Or just:
dt=dt.drop(range(40, 46))
Related
I have a dataframe from which I am trying to add attributes to my graph edges.
dataframe having mean_travel_time which is going to be the attribute for my edge
Plus, I have a data list which consists of source nodes and destination nodes as a tuple, like this.
[(1160, 2399),
(47, 1005)]
Now, while using set_edge_attribute to add attributes, I need my data into a dictionary:
{(1160, 2399):1434.67,
(47, 1005):2286.10,
}
I did something like this:
data_dict={}#Empty dictionary
for i in data:
data_dict[i] = df1['mean_travel_time'].iloc[i]#adding values
But, I am getting error saying too many indexers
Can anyone help me out with the error?
Please provide your data in a format easy to copy:
df = pd.DataFrame({
'index': [1, 9, 12, 18, 26],
'sourceid': [1160, 70, 1190, 620, 1791],
'dstid': [2399, 1005, 4, 103, 1944],
'month': [1] * 5,
'distance': [1434.67, 2286.10, 532.69, 593.20, 779.05]
})
If you are trying to iterate through a list of edges such as (1,2) you need to set an index for your DataFrame first:
df1.set_index(['sourceid', 'dstid'])
You could then access specific edges:
df.set_index(['sourceid', 'dstid']).loc[(1160, 2399)]
Or use a list of edges:
edges = zip(df['sourceid'], df['dstid'])
df.set_index(['sourceid', 'dstid']).loc[edges]
But you don't need to go any of this because, in fact, you can get your entire dict all in one go:
df.set_index(['sourceid', 'dstid'])['mean_travel_time'].to_dict()
This question already has answers here:
How to plot keys and values from dictionary in histogram
(3 answers)
Closed 2 years ago.
I am analyzing chess games in python and am trying to generate a histogram of number of moves per game. The goal is to obtain something similar to this visualization, where there seems to be on average 70 moves per game:
Currently I have an unordered dict:
{'106': 38,
'100': 46,
'65': 58,
'57': 47,
'54': 31,
'112': 29,
'93': 35,
'91': 44,
...
'109': 35,
'51': 26}
where the keys denote number of moves, and values denote number of games.
I am finding it stupidly difficult to extract the dictionary data for plotting the histogram. I have tried extracting to a dataframe but am unable to get it in a matplotlib/seaborn readable format.
Does this work for you?
import matplotlib.pyplot as plt
dict = {'106': 38,...}
plt.bar([int(d) for d in dict.keys()], dict.values())
plt.show()
Firstly, we are going to sort your dictionary with dict comprehension:
d = {'106': 38,
'100': 46,
'65': 58,
'57': 47,
'54': 31,
'112': 29,
'93': 35,
'91': 44,
'109': 35,
'51': 26}
sorted_d = {key:d[key] for key in sorted(d, key = lambda x: int(x))}
And now we display it with barplot:
import seaborn as sns
sns.barplot(x=list(sorted_d .keys()), y = list(d.sorted_d ()))
This is the result you get:
If you don't sort dictionary you get this:
This question already has answers here:
How to group dataframe rows into list in pandas groupby
(17 answers)
Closed 2 years ago.
I have a data frame where I only want values that contain a certain value. I've already implemented that. What I want now is the list grouped by user. What I get is every single element of the data frame in a list. How do I get this list [[User1.item1, ..., user1, itemn], ..., [Usern.item1, ..., usern, itemn]]
d = {'userid': [0, 0, 0, 1, 2, 2, 3, 3, 4, 4, 4],
'itemid': [715, 845, 98, 12324, 85, 715, 2112, 85, 2112, 852, 102]}
df = pd.DataFrame(data=d)
print(df)
users = df.loc[df.itemid == 715, "userid"]
df_new = df.loc[df.userid.isin(users)]
list_new = df_new[['itemid']].values.tolist()
# What I get
[[715],[845],[98],[85],[715]]
# What I want
[[715,845,98],[85,715]]
You may use a groupby operation
list_new = df_new.groupby("userid")['itemid'].apply(list).tolist()
print(list_new) # [[715, 845, 98], [85, 715]]
The intermediate operation is
list_new = df_new.groupby("userid")['itemid'].apply(list)
print(list_new)
userid
0 [715, 845, 98]
2 [85, 715]
Name: itemid, dtype: object
If you want to do all of your code in one line, you can use list comprehension:
[x for x in [*df.groupby('userid')['itemid'].apply(list)] if 715 in x]
[[715, 845, 98], [85, 715]]
The code:
[*df.groupby('userid')['itemid'].apply(list)]
is equivalent to
df_new.groupby("userid")['itemid'].apply(list).tolist()
and the remaining part is just looping through what is generated from that master list ^^^ to see if 715 is in any of the sublists, where x is the sublists in the code above.
1.We need to group our data by user id. Grouping is very important in many applications, such as in field of Machine learning preprocessing:
Example: Suppose our data is collected from sensors at various stations which are located at various parts of a state. Suppose we are measuring pressure and temperature. Suppose for our understanding let there be three stations Station-1, Station-2 and Station-3. In many practical scenarios we may have missing values in our data. If we use entire data to fill missing values, we may not get good results. But if we only use it's station's data to fill missing values we can get good results(Since conditions are different at different stations. But it is similar at particular station).
ans = df.groupby('userid')['itemid'].apply(list)
userid
0 [715, 845, 98]
1 [12324]
2 [85, 715]
3 [2112, 85]
4 [2112, 852, 102]
Name: itemid, dtype: object
Each row gives each user's all itemid's
For the following array;
[[[11, 22, 33]]],[[[32, 12, 3]]], I wanted to extract the 1st row and it should output 11,22,33. However, using the following code, I got [[11, 22, 33]]. How can I remove the double bracket?
df = pd.DataFrame([
[[[11, 22, 33]]],
[[[32, 12, 3]]]
], index=[1, 2], columns=['ColA'])
df[df.index == 1].ColA.item()
Expected output should be in the form of 11,22,33; without the bracket
Use .astype(str) and str.replace with the regex or operator (|). Then we use iat to get the first value:
df['ColA'].astype(str).str.replace('\[|\]', '').iat[0]
Output
'11, 22, 33'
Notice: that the type of your value changed from list to string
Or using native python functions str and replace:
str(df['ColA'].iat[0]).replace('[', '').replace(']', '')
I have a dataframe (df) whose column names are ["Home", "Season", "Date", "Consumption", "Temp"]. Now what I'm trying to do is perform calculations on these dataframe by "Home", "Season", "Temp" and "Consumption".
In[56]: df['Home'].unique().tolist()
Out[56]: [1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
In[57]: df['Season'].unique().tolist()
Out[57]: ['Spring', 'Summer', 'Autumn', 'Winter']
Here is what is done so far:
series = {}
for i in df['Home'].unique().tolist():
for j in df["Season"].unique().tolist():
series[i, j] = df[(df["Home"] == i) & (df["Consumption"] >= 0) & (df["Season"] == j)]
for key, value in series.items():
value["Corr"] = value["Temp"].corr(value["Consumption"])
Here is the dictionary of dataframes named "Series" as an output of loop.
What I expected from last loop is to give me a dictionary of dataframes with a new column i.e. "Corr" added that would have correlated values for "Temp" and "Consumption", but instead it gives a single dataframe for last home in the iteration i.e. 23.
To simply add sixth column named "Corr" in all dataframes in a dictionary that would be a correlation between "Temp" and "Consumption". Can you help me with the above? I'm somehow missing the use of keys in the last loop. Thanks in advance!
All of those loops are entirely unnecessary! Simply call:
df.groupby(['Home', 'Season'])['Consumption', 'Temp'].corr()
(thanks #jezrael for the correction)
One of the answer on How to find the correlation between a group of values in a pandas dataframe column
helped. Avoiding all unnecessary loops. Thanks #jezrael and #JoshFriedlander for suggesting groupby method. Upvote (y).
Posting solution here:
df = df[df["Consumption"] >= 0]
corrs = (df[["Home", "Season", "Temp"]]).groupby(
["Home", "Season"]).corrwith(
df["Consumption"]).rename(
columns = {"Temp" : "Corr"}).reset_index()
df = pd.merge(df, corrs, how = "left", on = ["Home", "Season"])