I have a data frame containing hyponym and hypernym pairs extracted from StackOverflow posts. You can see an excerpt from it in the following:
0 1 2 3 4
linq query asmx web service THH 10 a linq query as an asmx web service
application bolt THH 1 my application is a bolt on data visualization...
area r time THH 1 the area of the square is r times
sql query syntax HTH 3 sql like query syntax
...
7379596 rows × 5 columns
The column 0 and the column 1 contain the hyponym and hyperonym parts of the phrases contained by the column 4. I would like to implement a filter based on statistical features, therefore I have to count all occurrences of the pairs (0, 1) columns together, all occurrences of the hyponym and hyperonym parts respectively. Pandas has a method called value_counts(), so counting the occurrences can be obtained by:
df.value_counts([0])
df.value_counts([1])
df.value_counts([0, 1])
This is nice, but the method resulted in a Pandas Series which has much fewer records than the original DataFrame, therefore, adding a new column like df[5] = df.value_counts([0, 1]) does not work.
I have found a workaround: I have created 3 Pandas Series for every occurrence type (pair, hyponym, hyperonym) and I have written a small loop to calculate a confidence score for every pair but as the original dataset is huge (more than 7 million records) this calculation is not an efficient way to do that (the calculation has not finished after 30 hours). So, the feasible and hopefully efficient solution would be using the Pandas applymap() for this purpose, but it is needed to attach columns containing the occurrences to the original DataFrame. So I would like a DataFrame like this one:
0 1 2 3 4 5 6 7
sql query anything anything a phrase 1000 800 500
sql query anything anything anotherphrase 1000 800 500
...
The column 5 is the occurences of the hyponym part (sql), the column 6 is the number of occurrences of the hyperonym part (query) and the column 7 is the occurrences of the pair (sql,
query). As you can see the pairs are the same but they are extracted from different phrases.
My question is how to do that? How can I attach occurrences as a new column to an existing DataFrame?
Here's a solution on how to map the value counts of the combination of two columns to a new column:
# Create an example DataFrame
df = pd.DataFrame({0: ["a", "a", "a", "b"], 1: ["c", "d", "d", "d"]})
# Count the paired occurrences in a new column
df["count"] = df.groupby([0,1])[0].transform('size')
Before editing, I had answered this question with a solution using value_counts and a merge. This original solution is slower and more complicated than the groupby:
# Put the value_counts in a new DataFrame, call them count
vcdf = pd.DataFrame(df[[0, 1]].value_counts(), columns=["count"])
# Merge the df with the vcs
merged = pd.merge(left=df, right=vcdf, left_on=[0, 1], right_index=True)
# Potentially sort index
merged = merged.sort_index()
The resulting DataFrame:
0 1 count
0 a c 1
1 a d 2
2 a d 2
3 b d 1
I am attempting to generate a dataframe (or series) based on another dataframe, selecting a different column from the first frame dependent on the row using another series. In the below simplified example, I want the frame1 values from 'a' for the first three rows, and 'b for the final two (the picked_values series).
frame1=pd.DataFrame(np.random.randn(10).reshape(5,2),index=range(5),columns=['a','b'])
picked_values=pd.Series(['a','a','a','b','b'])
Frame1
a b
0 0.283519 1.462209
1 -0.352342 1.254098
2 0.731701 0.236017
3 0.022217 -1.469342
4 0.386000 -0.706614
Trying to get to the series:
0 0.283519
1 -0.352342
2 0.731701
3 -1.469342
4 -0.706614
I was hoping values[picked_values] would work, but this ends up with five columns.
In the real-life example, picked_values is a lot larger and calculated.
Thank you for your time.
Use df.lookup
pd.Series(frame1.lookup(picked_values.index,picked_values))
0 0.283519
1 -0.352342
2 0.731701
3 -1.469342
4 -0.706614
dtype: float64
Here's a NumPy based approach using integer indexing and Series.searchsorted:
frame1.values[frame1.index, frame1.columns.searchsorted(picked_values.values)]
# array([0.22095278, 0.86200616, 1.88047197, 0.49816937, 0.10962954])
I am trying to find the rows, in a very large dataframe, with the highest mean.
Reason: I scan something with laser trackers and used a "higher" point as reference to where the scan starts. I am trying to find the object placed, through out my data.
I have calculated the mean of each row with:
base = df.mean(axis=1)
base.columns = ['index','Mean']
Here is an example of the mean for each row:
0 4.407498
1 4.463597
2 4.611886
3 4.710751
4 4.742491
5 4.580945
This seems to work fine, except that it adds an index column, and gives out columns with an index of type float64.
I then tried this to locate the rows with highest mean:
moy = base.loc[base.reset_index().groupby(['index'])['Mean'].idxmax()]
This gives out tis :
index Mean
0 0 4.407498
1 1 4.463597
2 2 4.611886
3 3 4.710751
4 4 4.742491
5 5 4.580945
But it only re-index (I have now 3 columns instead of two) and does nothing else. It still shows all rows.
Here is one way without using groupby
moy=base.sort_values('Mean').tail(1)
It looks as though your data is a string or single column with a space in between your two numbers. Suggest splitting the column into two and/or using something similar to below to set the index to your specific column of interest.
import pandas as pd
df = pd.read_csv('testdata.txt', names=["Index", "Mean"], delimiter="\s+")
df = df.set_index("Index")
print(df)
Problem and what I want
I have a data file that comprises time series read asynchronously from multiple sensors. Basically for every data element in my file, I have a sensor ID and time at which it was read, but I do not always have all sensors for every time, and read times may not be evenly spaced. Something like:
ID,time,data
0,0,1
1,0,2
2,0,3
0,1,4
2,1,5 # skip some sensors for some time steps
0,2,6
2,2,7
2,3,8
1,5,9 # skip some time steps
2,5,10
Important note the actual time column is of datetime type.
What I want is to be able to zero-order hold (forward fill) values for every sensor for any time steps where that sensor does not exist, and either set to zero or back fill any sensors that are not read at the earliest time steps. What I want is a dataframe that looks like it was read from:
ID,time,data
0,0,1
1,0,2
2,0,3
0,1,4
1,1,2 # ID 1 hold value from time step 0
2,1,5
0,2,6
1,2,2 # ID 1 still holding
2,2,7
0,3,6 # ID 0 holding
1,3,2 # ID 1 still holding
2,3,8
0,5,6 # ID 0 still holding, can skip totally missing time steps
1,5,9 # ID 1 finally updates
2,5,10
Pandas attempts so far
I initialize my dataframe and set my indices:
df = pd.read_csv(filename, dtype=np.int)
df.set_index(['ID', 'time'], inplace=True)
I try to mess with things like:
filled = df.reindex(method='ffill')
or the like with various values passed to the index keyword argument like df.index, ['time'], etc. This always either throws an error because I passed an invalid keyword argument, or does nothing visible to the dataframe. I think it is not recognizing that the data I am looking for is "missing".
I also tried:
df.update(df.groupby(level=0).ffill())
or level=1 based on Multi-Indexed fillna in Pandas, but I get no visible change to the dataframe again, I think because I don't have anything currently where I want my values to go.
Numpy attempt so far
I have had some luck with numpy and non-integer indexing using something like:
data = [np.array(df.loc[level].data) for level in df.index.levels[0]]
shapes = [arr.shape for arr in data]
print(shapes)
# [(3,), (2,), (5,)]
data = [np.array([arr[i] for i in np.linspace(0, arr.shape[0]-1, num=max(shapes)[0])]) for arr in data]
print([arr.shape for arr in data])
# [(5,), (5,), (5,)]
But this has two problems:
It takes me out of the pandas world, and I now have to manually maintain my sensor IDs, time index, etc. along with my feature vector (the actual data column is not just one column but a ton of values from a sensor suite).
Given the number of columns and the size of the actual dataset, this is going to be clunky and inelegant to implement on my real example. I would prefer a way of doing it in pandas.
The application
Ultimately this is just the data-cleaning step for training recurrent neural network, where for each time step I will need to feed a feature vector that always has the same structure (one set of measurements for each sensor ID for each time step).
Thank you for your help!
Here is one way , by using reindex and category
df.time=df.time.astype('category',categories =[0,1,2,3,4,5])
new_df=df.groupby('time',as_index=False).apply(lambda x : x.set_index('ID').reindex([0,1,2])).reset_index()
new_df['data']=new_df.groupby('ID')['data'].ffill()
new_df.drop('time',1).rename(columns={'level_0':'time'})
Out[311]:
time ID data
0 0 0 1.0
1 0 1 2.0
2 0 2 3.0
3 1 0 4.0
4 1 1 2.0
5 1 2 5.0
6 2 0 6.0
7 2 1 2.0
8 2 2 7.0
9 3 0 6.0
10 3 1 2.0
11 3 2 8.0
12 4 0 6.0
13 4 1 2.0
14 4 2 8.0
15 5 0 6.0
16 5 1 9.0
17 5 2 10.0
You can have a dictionary of last readings for each sensors. You'll have to pick some initial value; the most logical choice is probably to back-fill the earliest reading to earlier times. Once you've populated your last_reading dictionary, you can just sort all the readings by time, update the dictionary for each reading, and then fill in rows according to the dictionay. So after you have your last_reading dictionary initialized:
last_time = readings[1][time]
for reading in readings:
if reading[time] > last_time:
for ID in ID_list:
df.loc[last_time,ID] = last_reading[ID]
last_time = reading[time]
last_reading[reading[ID]] = reading[data]
#the above for loop doesn't update for the last time
#so you'll have to handle that separately
for ID in ID_list:
df.loc[last_time,ID] = last_reading[ID]
last_time = reading[time]
This assumes that you have only one reading for each time/sensor pair, and that 'readings' a list of dictionaries sorted by time. It also assumes that df has the different sensors as columns and different times as index. Adjust the code as necessary if otherwise. You can also probably optimize it a bit more by updating a whole row at once instead of using a for loop, but I didn't want to deal with making sure I had the Pandas syntax right.
Looking at the application, though, you might want to have each cell in the dataframe be not a number but a tuple of last value and time it was read, so replace last_reading[reading[ID]] = reading[data] with
last_reading[reading[ID]] = [reading[data],reading[time]]. Your neural net can then decide how to weight data based on how old it is.
I got this to work with the following, which I think is pretty general for any case like this where the time index for which you want to fill values is the second in a multi-index with two indices:
# Remove duplicate time indices (happens some in the dataset, pandas freaks out).
df = df[~df.index.duplicated(keep='first')]
# Unstack the dataframe and fill values per serial number forward, backward.
df = df.unstack(level=0)
df.update(df.ffill()) # first ZOH forward
df.update(df.bfill()) # now back fill values that are not seen at the beginning
# Restack the dataframe and re-order the indices.
df = df.stack(level=1)
df = df.swaplevel()
This gets me what I want, although I would love to be able to keep the duplicate time entries if anybody knows of a good way to do this.
You could also use df.update(df.fillna(0)) instead of backfilling if starting unseen values at zero is preferable for a particular application.
I put the above code block in a function called clean_df that takes the dataframe as argument and returns the cleaned dataframe.
I want to update the mergeAllGB.Intensity columns NaN values with values from another dataframe where ID, weekday and hour are matching. I'm trying:
mergeAllGB.Intensity[mergeAllGB.Intensity.isnull()] = precip_hourly[precip_hourly.SId == mergeAllGB.SId & precip_hourly.Hour == mergeAllGB.Hour & precip_hourly.Weekday == mergeAllGB.Weekday].Intensity
However, this returns ValueError: Series lengths must match to compare. How could I do this?
Minimal example:
Inputs:
_______
mergeAllGB
SId Hour Weekday Intensity
1 12 5 NaN
2 5 6 3
precip_hourly
SId Hour Weekday Intensity
1 12 5 2
Desired output:
________
mergeAllGB
SId Hour Weekday Intensity
1 12 5 2
2 5 6 3
TL;DR this will (hopefully) work:
# Set the index to compare by
df = mergeAllGB.set_index(["SId", "Hour", "Weekday"])
fill_df = precip_hourly.set_index(["SId", "Hour", "Weekday"])
# Fill the nulls with the relevant values of intensity
df["Intensity"] = df.Intensity.fillna(fill_df.Intensity)
# Cancel the special indexes
mergeAllGB = df.reset_index()
Alternatively, the line before the last could be
df.loc[df.Intensity.isnull(), "Intensity"] = fill_df.Intensity
Assignment and comparison in pandas are done by index (which isn't shown in your example).
In the example, running precip_hourly.SId == mergeAllGB.SId results in ValueError: Can only compare identically-labeled Series objects. This is because we try to compare the two columns by value, but precip_hourly doesn't have a row with index 1 (default indexing starts at 0), so the comparison fails.
Even if we assume the comparison succeeded, the assignment stage is problematic.
Pandas tries to assign according to the index - but this doesn't have the intended meaning.
Luckily, we can use it for our own benefit - by setting the index to be ["SId", "Hour", "Weekday"], any comparison and assignments will be done with relation to this index, so running df.Intensity= fill_df.Intensity will assign to df.Intensity the values in fill_df.Intensity wherever the index match, that is, wherever they have the same ["SId", "Hour", "Weekday"].
In order to assign only to the places where the Intensity is NA, we need to filter first (or use fillna). Note that filter by df.Intensity[df.Intensity.isnull()] will work, but assignment to it will probably fail if you have several values with the same (SId, Hour, Weekday) values.