Find position in a csv list - python

Say I have the following CSV file;
1.41, 123456
1.42, 123456
1.43, 123456
and i want to find the "position"/location of a value in row 0 i.e. "1.41, 1.42, 1.43" in this case, depending on whether the particular row value is greater than or equal to an arbitrary inputted value.
So for example if the inputted value was 1.42, we would return position 0 and 1, or if the inputted value was 1.4, we would return 0, 1 and 2. Likewise, if the value was 1.45 we would not return any positions. Here is what i have:
out = open("test.csv","rU")
dataf=csv.reader(out, delimiter=',')
for row in dataf:
x = row[0]
for i in xrange(len(x)):
if x[i] >=1 :
print i
only to get,
0
1
2
3
0
1
2
3
0
1
2
3
so then i use
for i in xrange(len(x)):
if x[i] >=2 :
print i
But i still get the same position values. Can anyone steer me in the right direction?

From what I can gather, this does what you're asking...
#!/usr/bin/env python
import csv
value = 1.42
out = open("test.csv","rU")
dataf=csv.reader(out, delimiter=',')
matches = [float(row[0]) >= value for row in dataf]
matches.reverse()
for i,m in enumerate(matches):
if m:
print i
matches is a list of boolean values, that shows whether the first column in each row is greater than value. It looks like you want to order from the bottom up, so I reversed the list. The loop prints the index of the (reversed) list if the value in the first column was greater than or equal to value.
value = 1.42
output:
0
1
value = 1.4
output:
0
1
2
value = 1.45
no output.

Related

Find a Collection of Indexes Provided that the Value

So, I have data like this
Index c1
sls1 6
sls2 4
sls3 7
sls4 5
sls5 5
I want to find a collection of indexes provided that the value of column c2 on some indexes amounts to less than equal to 10 with looping. Then I save the index set as a list on a new data frame, which is output.
output = []
output
[sls1, sls2]
[sls3]
[sls4, sls5]
The first row is sls1, sls2 because the number of values from both indices is equal to 10, while the second row of sls3 only because the value of column c1 in index sls3 is 7 where if added up with the next index values will amount to more than 10. And so on
Thank You
There is no vectorized way to compute a cumulated sum with restart on a threshold, you'll have to use a loop.
Then combine this with groupby.agg:
def group(s, thresh=10):
out = []
g = 0
curr_sum = 0
for v in s:
curr_sum += v
if curr_sum > thresh:
g += 1
curr_sum = v
out.append(g)
return pd.Series(out, index=s.index)
out = df.groupby(group(df['c1']))['Index'].agg(list)
Output:
0 [sls1, sls2]
1 [sls3]
2 [sls4, sls5]
Name: Index, dtype: object

How to judge the difference in any consecutive rows is greater than 0.1 in dataframe?

The data in my csv like this:
staff_id,clock_time,device_id,latitude,longitude
1001,2020/9/14 4:43:00,d_1,24.59652556,118.0824644
1001,2020/9/14 8:34:40,d_1,24.59732974,118.0859631
1001,2020/9/14 3:33:34,d_1,24.73208312,118.0957197
1001,2020/9/14 4:17:29,d_1,24.59222786,118.0955275
1001,2020/9/20 5:30:56,d_1,24.59689407,118.2863806
1001,2020/9/20 7:26:05,d_1,24.58237852,118.2858955
I want to find any row where the difference between longitude or latitude of 2 consecutive rows is greater than 0.1,then put the row index of two consecutive rows into a list.
From my data, the latitude difference of rows 2(24.59732974), 3(24.73208312), 4(24.59222786) greater than 0.1, and the longitude difference of rows 4(118.0955275),5(118.2863806) greater than 0.1.
I want to put the indexes of rows 2, 3, 4 into a list latitude_diff_list, and put the index of 4,5 rows into another list longitude_diff_list, what should I do?
You need to use a combination of diff(), to check if the absolute difference with the next or the previous row is more than 0.1, and then get the indices of these rows (I understand you actually want the index, not the descriptive row number, i.e. an index that starts from 0). One way you could do this is:
latitude_diff_list = df.index[(abs(df['latitude'].diff()) > 0.1) | (abs(df['latitude'].diff(-1)) > 0.1)].tolist()
longitude_diff_list = df.index[(abs(df['longitude'].diff()) > 0.1) | (abs(df['longitude'].diff(-1)) > 0.1)].tolist()
You can then offset this by +1 if you want the row number starting from 1 (e.g. [i+1 for i in latitude_diff_list])
I believe you need absolute difference between original and shifted values, compared by DataFrame.gt for greater:
m1 = df[['latitude','longitude']].diff().abs().gt(0.1)
m2 = df[['latitude','longitude']].shift().diff().abs().gt(0.1)
m = m1 | m2
print (m)
latitude longitude
0 False False
1 False False
2 True False
3 True False
4 True True
5 False True
latitude_diff_list = df.index[m['latitude']].tolist()
print (latitude_diff_list)
[2, 3, 4]
longitude_diff_list = df.index[m['longitude']].tolist()
print (longitude_diff_list)
[4, 5]
This should work:
import pandas as pd
df_ex = pandas.read_csv('ex.csv', sep=',')
latitude_diff_list, longitude_diff_list = [], []
for idx,row in df_ex[1:].iterrows():
if abs(row['latitude'] - df_ex.loc[idx-1, 'latitude']) > 0.1:
latitude_diff_list.extend([idx-1, idx])
if abs(row['longitude'] - df_ex.loc[idx-1, 'longitude']) > 0.1:
longitude_diff_list.extend([idx-1, idx])
latitude_diff_list, longitude_diff_list = list(set(latitude_diff_list)), list(set(longitude_diff_list))

How to print the sum of columns and the index when the sum is not 0?

I have a pandas dataframe similar to this structure:
a b c
1 0 1 0
2 0 0 0
3 1 0 0
4 0 0 0
5 0 0 0
I want to know if the sum of each row is != 0, so I try to use a for loop iterating each row and sum them with the builtin .sum() function and check if the condition applies.
The problem is that 99% of the data (>200,000 records) is filled with 0s, and my goal is to know which index whose sum is > 0.
Ive tried this
for x in range(len(people_killed)):
print("Checking row"+str(x))
if people_killed.iloc[x].sum() == 0:
people_killed = people_killed.drop(x, axis=0)
but it will take a long time to get through every row.
What would be the best way to do this?
Thanks a lot beforehand!
You can use sum and then find nonzero indices as follows:
np.flatnonzero(people_killed.sum(1))
#[0, 2]
people_killed[people_killed.apply(sum, axis = 1) != 0]
Let me give you a brief logic about this problem . You must not find the sum of each element in the row but if there are all the positive numbers then just find a single number greater than 0 .That is when you iterate the loop stop the loop until you find a number greater than 0 .The sum of row will not become zero .
To answer your first question: How to print the sum of columns
(in each row), run:
people_killed.sum(axis=1)
The result is:
1 1
2 0
3 1
4 0
5 0
dtype: int64
The left column is the index and the right column - sums for each row.
And as your second question is concerned, note that:
people_killed.sum(axis=1).ne(0) generates a Series of bool,
answering the question: Has this row a non-zero sum?
people_killed[people_killed.sum(axis=1).ne(0)] retrieves all
rows with sum != 0 (an example of boolean indexing).
So to get your result only one addition is needed: Retrieve only the
index of these rows:
people_killed[people_killed.sum(axis=1).ne(0)].index
The result is Int64Index([1, 3], dtype='int64'), so it is a list
of index values of the "wanted" rows, not integer positions of these rows
(as the solution by Ehsan generates).
My solution computes just what you asked for: indices.

Select specific columns

I've a scientist dataframe
radius date spin atom
0 12,50 YYYY/MM 0 he
1 11,23 YYYY/MM 2 c
2 45,2 YYYY/MM 1 z
3 11,1 YYYY/MM 1 p
I want select for each row, all rows where the difference between the radius is under, for exemple 5
I've define a function to calc (simple,it's an example):
def diff_radius (a,b)
return a-b
Is-it possible for each rows to find some rows which check the condition in calling an external function?
I try some way, not working:
for i in range(df.shape[0]):
....
df_in_radius=df.apply(lambda x : diff_radius(df[i]['radius'],x['radius']))
Can you help me?
I am assuming that the datatype of the radius column is a tuple. You can keep the diff_radius method like
def diff_radius(x):
a, b = x
return a-b
Then, you can use loc method in pandas to select the rows which matches the condition of radius differece less than 5.
df.loc[df.radius.apply(diff_radius) < 5]
Edit #1
If the datatype of the radius column is a string, then split them and typecast. The logic will go in the diff_radius method. In case of string
def diff_radius(x):
x_split = x.split(',')
a,b = int(x_split[0]), int(x_split[-1])
return a-b
I misspoke.
My dataframe is :
radius of my atom date spin atom
0 12.50 YYYY/MM 0 he
1 11.23 YYYY/MM 2 c
2 45.2 YYYY/MM 1 z
3 11.1 YYYY/MM 1 p
I do a loop , to apply on one row a special calcul of each row whose respond condition.
Example:
def diff_radius(current_row,x):
current_row['radius']-x['radius']
return a-b
df=pd.read_csv(csvfile,delimiter=";",names=('radius','date','spin','atom'))
# for each row of original dataframe
for i in range(df.shape[0]):
# first build a new and tmp dataframe with row
# which have a radius less 5 than df.iloc[i]['radius] (level of loop)
df_tmp=df[diff_radius(df.iloc[i]['radius],df['radius']) <5]
....
# start of special calc, with the df_tmp which contains all of rows
# less 5 than the current row **(i)**
I thank you sincerely for your answers

Counting the weighted intersection of equivalent rows in two tables

The following question is a generalization to the question posted here:
Counting the intersection of equivalent rows in two tables
I have two FITS files. For example, the first file has 100 rows and 2 columns. The second file has 1000 rows and 3 columns.
FITS FILE 1 FITS FILE 2
A B C D E
1 2 1 2 0.1
1 3 1 2 0.3
2 4 1 2 0.9
I need to take the first row of the first file, i.e 1 and 2 and check how many rows in the second file have C = 1 and D = 2 weighting each pair (C,D) with respect to the corresponding value in column E.
In the example, I have 3 rows in the second file that have C = 1 and D = 2. They have weights E = 0.1, 0.3, and 0.9, respectively. Weighting with respect to the values in E, I need to associate the value 0.1+0.3+0.9 = 1.3 to the pair (A,B) = (1,2) of the first file. Then, I need to do the same for the second row (first file), i.e 1 and 3 and find out how many rows in the second file have 1 and 3, again weighting with respect to the value in column E, and so on.
The first file does not have duplicates (all the rows have different pairs, none are identical, only file 2 has many identical pairs which I need to find).
I finally need the weighted numbers of rows in the second file that have the similar values as that of the rows of the first FITS file.
The result should be:
A B Number
1 2 1.3 # 1 and 2 occurs 1.3 times
1 3 4.5 # 1 and 3 occurs 4.5 times
and so on for all pairs in A and B columns.
I know from the post cited above that the solution for weights in column E all equal to 1 involves Counter, as follows:
from collections import Counter
# Create frequency table of (C,D) column pairs
file2freq = Counter(zip(C,D))
# Look up frequency value for each row of file 1
for a,b in zip(A,B):
# and print out the row and frequency data.
print a,b,file2freq[a,b]
To answer the question I need to include the weights in E when I use Counter:
file2freq = Counter(zip(C,D))
I was wondering if it is possible to do that.
Thank you very much for your help!
I'd follow up on the suggestion made by Iguananaut in the comments to that question. I believe numpy is an ideal tool for this.
import numpy as np
fits1 = np.genfromtxt('fits1.csv')
fits2 = np.genfromtxt('fits2.csv')
summed = np.zeros(fits1.shape[0])
for ind, row in enumerate(fits1):
condition = (fits2[:,:2] == row).all(axis=1)
summed[ind] = fits2[condition,-1].sum() # change the assignment operator to += if the rows in fits1 are not unique
After the import, the first 2 lines will load the data from the files. That will return an array of floats, which comes with the warning: comparing one float to another is prone to bugs. In this case it will work though, because both the columns in fits1.csv and the first 2 columns in fits2.csv are integers and parsed in the same manner by genfromtxt.
Then, in the for-loop the variable condition is created, which states that anytime the first two columns in fits2 match with the columns of the current row of fits1, it is to be taken into account (the result is a boolean array).
Then, finally, for the current row index ind, set the value of the array summed to the sum of all the values in column 3 of fits2, where the condition was True.
For a mini example I made, I got this:
oliver#armstrong:/tmp/sto$ cat fits1.csv
1 2
1 3
2 4
oliver#armstrong:/tmp/sto$ cat fits2.csv
1 2 .1
1 2 .3
1 2 .9
2 4 .3
1 5 .5
2 4 .7
# run the above code:
# summed is:
# array([ 1.3, 0. , 1. ])

Categories

Resources