I have a source dataframe which needs to be looped through for all the values of Comments which are Grouped By values present in corresponding Name field and the result needs to be appended as a new column in the DF. This can be into a new DataFrame as well.
Input Data :
Name Comments
0 N-1 Good
1 N-2 bad
2 N-3 ugly
3 N-1 very very good
4 N-3 what is this
5 N-4 pathetic
6 N-1 needs improvement
7 N-2 this is not right
8 Ano-5 It is average
[8 rows x 2 columns]
For example - For all values of Comments of Name N-1, run a loop and add the output as a new column along with these 2 values (of Name, Comment).
I tried to do the following, and was able to group by based on Name. But I am unable to run through all values of Comments for them to append the output :
gp = CommentsData.groupby(['Document'])
for g in gp.groups.items():
Data1 = CommentsData.loc[g[1]]
#print(Data1)
Data in Group by loop comes like :
Name Comments
0 N-1 good
3 N-1 very very good
6 N-1 needs improvement
1 N-2 bad
7 N-2 this is not right
I am unable to access the values in 2nd column.
Using df.iloc[i] - I am only able to access first element. But not all (as the number of elements will vary for different values of Names).
Now, I want to use the values in Comment and then add the output as an additional column in the dataframe(can be a new DF).
Expected Output :
Name Comments Result
0 N-1 Good A
1 N-2 bad B
2 N-3 ugly C
3 N-1 very very good A
4 N-3 what is this B
5 N-4 pathetic C
6 N-1 needs improvement C
7 N-2 this is not right B
8 Ano-5 It is average B
[8 rows x 3 columns]
you can use apply and reset_index
df.groupby('Name').Comments.apply(pd.DataFrame.reset_index, drop=True).unstack()
Related
I searched around for something like this but couldn't find anything. Also struggled with the wording of my question so bear with me here. I'm working with data provided in an excel file and have excel/VBA and python available to use. I'm still learning both and I just need a push in the right direction on what method to use so I can work through it and learn.
Say I have a series of 5 processes in a manufacturing facility, each process represented by a column. Occasionally a downstream process (column 5) gets backed up and slows down upstream processes (column 4, then 3, etc). I have a 2D array that indicates running(0) or backed up(1). Each column is an individual process and each row is a time interval. The array will be the same size every time (10000+ rows, 5 columns), but the values change. Example:
MyArray = [0 0 0 0 0
0 1 0 0 0
1 1 0 0 1
1 0 0 0 1
0 0 0 1 1
0 0 1 1 1
0 0 1 1 0
0 1 1 0 0
0 1 0 0 0
0 0 0 0 0]
Essentially when there is a value of 1, I want to trace it to the upper rightmost adjacent 1. So for column 1, rows 2 & 3 would be traced back to column 2, row 2. For column 2, rows 8 & 9 would be traced back to column 5, row 3. Currently I have just have an if statement looking to the right within the same row and it's better than nothing, but does not capture the cascade effect you get when something backs up multiple upstream processes.
I could figure out a way to do it by looping a search to look up in a column until it finds a zero, then if the next column value is a 1, search up until finds a zero, repeat. But this seems really inefficient and I feel like there has to be a better way to do it. Any kinds of thoughts or comments would be very helpful. Thanks.
One tip is to actually begin from the bottom row (let's call this row r=0) and first column c=1 and search rightward until you hit column c=5. If no 1s are found then repeat the search for the second-to-bottom row, and continue until you search the top row.
Whenever a 1 is found at some index (r, c), recursively ask the question "does this element have all zeros above (r+1, c), to the right (r, c+1), or above and to the right (r+1, c+1). Now, I'm not sure what you do in the case of a "tie", i.e. when there is a 1 above, at (r+1, c), and another 1 to the right, at (r, c+1), but a 0 at (r+1, c+1), but you could address this competition in a simple if statement.
The following question is a generalization to the question posted here:
Counting the intersection of equivalent rows in two tables
I have two FITS files. For example, the first file has 100 rows and 2 columns. The second file has 1000 rows and 3 columns.
FITS FILE 1 FITS FILE 2
A B C D E
1 2 1 2 0.1
1 3 1 2 0.3
2 4 1 2 0.9
I need to take the first row of the first file, i.e 1 and 2 and check how many rows in the second file have C = 1 and D = 2 weighting each pair (C,D) with respect to the corresponding value in column E.
In the example, I have 3 rows in the second file that have C = 1 and D = 2. They have weights E = 0.1, 0.3, and 0.9, respectively. Weighting with respect to the values in E, I need to associate the value 0.1+0.3+0.9 = 1.3 to the pair (A,B) = (1,2) of the first file. Then, I need to do the same for the second row (first file), i.e 1 and 3 and find out how many rows in the second file have 1 and 3, again weighting with respect to the value in column E, and so on.
The first file does not have duplicates (all the rows have different pairs, none are identical, only file 2 has many identical pairs which I need to find).
I finally need the weighted numbers of rows in the second file that have the similar values as that of the rows of the first FITS file.
The result should be:
A B Number
1 2 1.3 # 1 and 2 occurs 1.3 times
1 3 4.5 # 1 and 3 occurs 4.5 times
and so on for all pairs in A and B columns.
I know from the post cited above that the solution for weights in column E all equal to 1 involves Counter, as follows:
from collections import Counter
# Create frequency table of (C,D) column pairs
file2freq = Counter(zip(C,D))
# Look up frequency value for each row of file 1
for a,b in zip(A,B):
# and print out the row and frequency data.
print a,b,file2freq[a,b]
To answer the question I need to include the weights in E when I use Counter:
file2freq = Counter(zip(C,D))
I was wondering if it is possible to do that.
Thank you very much for your help!
I'd follow up on the suggestion made by Iguananaut in the comments to that question. I believe numpy is an ideal tool for this.
import numpy as np
fits1 = np.genfromtxt('fits1.csv')
fits2 = np.genfromtxt('fits2.csv')
summed = np.zeros(fits1.shape[0])
for ind, row in enumerate(fits1):
condition = (fits2[:,:2] == row).all(axis=1)
summed[ind] = fits2[condition,-1].sum() # change the assignment operator to += if the rows in fits1 are not unique
After the import, the first 2 lines will load the data from the files. That will return an array of floats, which comes with the warning: comparing one float to another is prone to bugs. In this case it will work though, because both the columns in fits1.csv and the first 2 columns in fits2.csv are integers and parsed in the same manner by genfromtxt.
Then, in the for-loop the variable condition is created, which states that anytime the first two columns in fits2 match with the columns of the current row of fits1, it is to be taken into account (the result is a boolean array).
Then, finally, for the current row index ind, set the value of the array summed to the sum of all the values in column 3 of fits2, where the condition was True.
For a mini example I made, I got this:
oliver#armstrong:/tmp/sto$ cat fits1.csv
1 2
1 3
2 4
oliver#armstrong:/tmp/sto$ cat fits2.csv
1 2 .1
1 2 .3
1 2 .9
2 4 .3
1 5 .5
2 4 .7
# run the above code:
# summed is:
# array([ 1.3, 0. , 1. ])
I am looking for the right approach for solve the following task (using python):
I have a dataset which is a 2D matrix. Lets say:
1 2 3
5 4 7
8 3 9
0 7 2
From each row I need to pick one number which is not 0 (I can also make it NaN if that's easier).
I need to find the combination with the lowest total sum.
So far so easy. I take the lowest value of each row.
The solution would be:
1 x x
x 4 x
x 3 x
x x 2
Sum: 10
But: There is a variable minimum and a maximum sum allowed for each column. So just choosing the minimum of each row may lead to a not valid combination.
Let's say min is defined as 2 in this example, no max is defined. Then the solution would be:
1 x x
5 x x
x 3 x
x x 2
Sum: 11
I need to choose 5 in row two as otherwise column one would be below the minimum (2).
I could use brute force and test all possible combinations. But due to the amount of data which needs to be analyzed (amount of data sets, not size of each data set) that's not possible.
Is this a common problem with a known mathematical/statistical or other solution?
Thanks
Robert
I wanna implement a calculate method like a simple scenario:
value computed as the sum of daily data during the previous N days (set N = 3 in the following example)
Dataframe df: (df.index is 'date')
date value
20140718 1
20140721 2
20140722 3
20140723 4
20140724 5
20140725 6
20140728 7
......
to do calculating like:
date value new
20140718 1 0
20140721 2 0
20140722 3 0
20140723 4 6 (3+2+1)
20140724 5 9 (4+3+2)
20140725 6 12 (5+4+3)
20140728 7 15 (6+5+4)
......
Now I have done this using for cycle like:
df['value']=[0]*len(df)
for idx in df.index
loc=df.index.get_loc(idx)
if((loc-N)>=0):
tmp=df.ix[df.index[loc-3]:df.index[loc-1]]
sum=tmp['value'].sum()
else:
sum=0
df['new'].ix(idx)=sum
But, when the length of dataframe or the value of N is very long / big, these calculating will be very slow....How I can implement this faster using a function or by other ways?
Besides, if the scenario is more complex? how ? Thanks.
Since you want the sum of the previous three excluding the current one, you can use rolling_apply over the a window of four and sum up all but the last value.
new = rolling_apply(df, 4, lambda x:sum(x[:-1]), min_periods=4)
This is the same as shifting afterwards with a window of three:
new = rolling_apply(df, 3, sum, min_periods=3).shift()
Then
df["new"] = new["value"].fillna(0)
I have two files that contain two columns each. The first column is an integer. The second column is a linear coordinate. Not every coordinate is represented, and I would like to insert all coordinates that are missing. Below is an example from one file of my data:
3 0
1 10
1 100
2 1000
1 1000002
1 1000005
1 1000006
For this example, coordinates 1-9, 11-99, etc are missing but need to be inserted, and need to be given a count of zero (0).
3 0
0 1
0 2
0 3
0 4
0 5
0 6
0 7
0 8
0 9
1 10
........
With the full set of rows, I then need to add add (1) to every count (the first column). Finally, I would like to do a simple calculation (the ratio) between the corresponding rows of the first column in the two files. The ratio should be real numbers.
I'd like to be able to do this with Unix if possible, but am somewhat familiar with python scripting as well. Any help is greatly appreciated.
This should work with Python 2.3 onwards.
I assumed that your file is space delimited.
If you want values past 1000006, you will need to change the value for desired_range .
import csv
desired_range = 1000007
reader = csv.reader(open('fill_range_data.txt'), delimiter=' ')
data_map = dict()
for row in reader:
frequency = int(row[0])
value = int(row[1])
data_map[value] = frequency
for i in range(desired_range):
if i in data_map:
print data_map[i], i
else:
print 0, i