I've Read the CVS file using pandas and have managed to print the 1st, 2nd, 3rd and 4th row for every 20 rows using .iloc.
Prem_results = pd.read_csv("../data sets analysis/prem/result.csv")
Prem_results.iloc[:320:20,:]
Prem_results.iloc[1:320:20,:]
Prem_results.iloc[2:320:20,:]
Prem_results.iloc[3:320:20,:]
Is there a way using iloc to print the 1st 4 rows of every 20 lines together rather then seperately like I do now? Apologies if this is worded badly fairly new to both python and using pandas.
Using groupby.head:
Prem_results.groupby(np.arange(len(Prem_results)) // 20).head(4)
You can concat slices together like this:
pd.concat([df[i::20] for i in range(4)]).sort_index()
MCVE:
df = pd.DataFrame({'col1':np.arange(1000)})
pd.concat([df[i::20] for i in range(4)]).sort_index().head(20)
Output:
col1
0 0
1 1
2 2
3 3
20 20
21 21
22 22
23 23
40 40
41 41
42 42
43 43
60 60
61 61
62 62
63 63
80 80
81 81
82 82
83 83
Start at 0 get every 20 rows
Start at 1 get every 20 rows
Start at 2 get every 20 rows
And, start at 3 get every 20 rows.
You can also do this while reading the csv itself.
df = pd.DataFrame()
for chunk in pd.read_csv(file_name, chunksize = 20):
df = pd.concat((df, chunk.head(4)))
More resources:
You can read more about the usage of chunksize in Pandas official documentation here.
I also have a post about its usage here.
I have a CSV file where the data of one ID is in many rows. I want to merge all the data of one ID in one row by increasing the number of columns.
Id X Y
ABC 56 23
ABC 77 74
XYX 11 51
to
Id X Y X Y
ABC 56 23 77 74
XYX 11 51
How to do it?
I have a dataframe (df_input), and im trying to convert it to another dataframe (df_output), through applying a formula to each element in each row. The formula requires information about the the whole row (min, max, median).
df_input:
A B C D E F G H I J
2011-01-01 60 48 26 29 41 91 93 87 39 65
2011-01-02 88 52 24 99 1 27 12 26 64 87
2011-01-03 13 1 38 60 8 50 59 1 3 76
df_output:
F(A)F(B)F(C)F(D)F(E)F(F)F(G)F(H)F(I)F(J)
2011-01-01 93 54 45 52 8 94 65 37 2 53
2011-01-02 60 44 94 62 78 77 37 97 98 76
2011-01-03 53 58 16 63 60 9 31 44 79 35
Im trying to go from df_input to df_output, as above, after applying f(x) to each cell per row. The function foo is trying to map element x to f(x) by doing an OLS regression of the min, median and max of the row to some co-ordinates. This is done each period.
I'm aware that I iterate over the rows and then for each row apply the function to each element. Where i am struggling is getting the output of foo, into df_output.
for index, row in df_input.iterrows():
min=row.min()
max=row.max()
mean=row.mean()
#apply function to row
new_row = row.apply(lambda x: foo(x,min,max,mean)
#add this to df_output
help!
My current thinking is to build up the new df row by row? I'm trying to do that but im getting a lot of multiindex columns etc. Any pointers would be great.
thanks so much... merry xmas to you all.
Consider calculating row aggregates with DataFrame.* methods and then pass series values in a DataFrame.apply() across columns:
# ROW-WISE AGGREGATES
df['row_min'] = df.min(axis=1)
df['row_max'] = df.max(axis=1)
df['row_mean'] = df.mean(axis=1)
# COLUMN-WISE CALCULATION (DEFAULT axis=0)
new_df = df[list('ABCDEFGHIJ')].apply(lambda col: foo(col,
df['row_min'],
df['row_max'],
df['row_mean']))
How do I parse this type of tables?
https://primes.utm.edu/lists/small/10000.txt
The First 10,000 Primes
(the 10,000th is 104,729)
For more information on primes see http://primes.utm.edu/
2 3 5 7 11 13 17 19 23 29
31 37 41 43 47 53 59 61 67 71
73 79 83 89 97 101 103 107 109 113
These are not comma separated or xml structured numbers. Do you know any way to, say, read them into a list?
You can parse the structure of the table just by knowing your data starts at the fourth line and ends one line before the end. Furthermore, the whole table has integer contents. For example:
# Using the requests HTTP client library
import requests
# Get data from HTTP request
data = requests.get("http://primes.utm.edu/lists/small/10000.txt").text
# Nested list comprehension: Split data into lines, consider from fourth line to second last, then split those lines into columns which will be evaluated as integers.
[[int(e) for e in l.strip().split()] for l in data.split('\n')[4:-2]]
Voilà.
This works because the implicit split method will split on whitespaces such as tabs, group of spaces, etc.
See below data matrix get from sensors, just INT numbers, nothing specical.
A B C D E F G H I J K
1 25 0 25 66 41 47 40 12 69 76 1
2 17 23 73 97 99 39 84 26 0 44 45
3 34 15 55 4 77 2 96 92 22 18 71
4 85 4 71 99 66 42 28 41 27 39 75
5 65 27 28 95 82 56 23 44 97 42 38
…
10 95 13 4 10 50 78 4 52 51 86 20
11 71 12 32 9 2 41 41 23 31 70
12 54 31 68 78 55 19 56 99 67 34 94
13 47 68 79 66 10 23 67 42 16 11 96
14 25 12 88 45 71 87 53 21 96 34 41
The horizontal A to K is the sensor name, and vertical is the data from sensor by the timer manner.
Now I want to analysis those data with trial-and-error methods, I defined some concepts to explain what I want:
o source
source is all the raw data I get
o entry
a entry is a set of all A to K sensor, take the vertical 1st row for example: the entry is
25 0 25 66 41 47 40 12 69 76 1
o rules
a rule is a "suppose" function with assert value return, so far just "true" or "false".
For example, I suppose the sensor A, E and F value will never be same in one enrty, if one entry with A=E=F, it will tigger violation and this rule function will return false.
o range:
a range is function for selecting vertical entry, for example, the first 5 entries
Then, the basic idea is:
o source + range = subsource(s)
o subsource + rules = valiation(s)
The finally I want to get a list may looks like this:
rangeID ruleID violation
1 1 Y
2 1 N
3 1 Y
1 2 N
2 2 N
3 2 Y
1 3 N
2 3 Y
3 3 Y
But the problem is the rule and range I defined here will getting very complicated soon if you looks deeper, they have too much possible combinations, take "A=E=F" for example, one can define "B=E=F","C=E=F","C>F" ......
So soon I need a rule/range generator which may accept those "core parameters" such as "A=E=F" as input parameter even using regex string later. That is too complicated just defeated me, leave alone I may need to persistence rules unique ID, data storage problem, rules self nest combination problem ......
So my questions are:
Anyone knows if there's some module/soft fit for this kind of trial-and-error calculation or the rules defination I want?
Anyone can share me a better rules/range design I described?
Thanks for any hints.
Rgs,
KC
If I understand what you're asking correctly, I probably wouldn't even venture down the Numbpy path as I don't think given your description that it's really required. Here's a sample implementation of how I might go about solving the specific issue that you presented:
l = [\
{'a':25, 'b':0, 'c':25, 'd':66, 'e':41, 'f':47, 'g':40, 'h':12, 'i':69, 'j':76, 'k':1},\
{'a':25, 'b':0, 'c':25, 'd':66, 'e':41, 'f':47, 'g':40, 'h':12, 'i':69, 'j':76, 'k':1}\
]
r = ['a=g=i', 'a=b', 'a=c']
res = []
# test all given rules
for n in range(0, len(r)):
# i'm assuming equality here - you'd have to change this to accept other operators if needed
c = r[n].split('=')
vals = []
# build up a list of values given our current rule
for e in c:
vals.append(l[0][e])
# using len(set(v)) gives us the number of distinct values
res.append({'rangeID': 0, 'ruleID':n, 'violation':'Y' if len(set(vals)) == 1 else 'N'})
print res
Output:
[{'violation': 'N', 'ruleID': 0, 'rangeID': 0}, {'violation': 'N', 'ruleID': 1, 'rangeID': 0}, {'violation': 'Y', 'ruleID': 2, 'rangeID': 0}]
http://ideone.com/zbTZr
There are a few assumptions made here (such as equality being the only operator in use in your rules) and some functionality left out (such as parsing your input to the list of dicts I used, but I'm hopeful that you can figure that out on your own.
Of course, there could be a Numpy-based solution that's simpler than this that I'm just not thinking of at the moment (it's late and I'm going to bed now ;)), but hopefully this helps you out anyway.
Edit:
Woops, missed something else (forgot to add it in prior to posting) - I only test the first element in l (the given range).. You'd just want to stick that in another for loop rather than using that hard-coded 0 index.
You want to look at Numpy matrix for data structures like matrix etc. It exposes a list of functions that work on matrix manipulation.
As for rule / range generator I am afraid you will have to build your own domain specific language to achieve that.