import numpy as np
def data_verify(source):
rows = [x.strip().split(' ') for x in open(source)]
columns = zip(*rows)
blocks = np.array(rows).reshape((3,3,3,3)).transpose((0,2,1,3)).reshape((9,9))
#check iff, see further
return rows, columns, blocks
else:
return False
Got a sudoku grid in txt as such:
3 2 7 4 8 1 6 5 9
1 8 9 3 6 5 7 2 4
6 5 4 2 7 9 8 1 3
7 9 8 1 3 2 5 4 6
5 6 3 9 4 7 2 8 1
2 4 1 6 5 8 3 9 7
8 1 2 7 9 3 4 6 5
4 7 5 8 1 6 9 3 2
9 3 6 5 2 4 1 7 8
The function collects all relevant data and will return the respective rows, columns and blocks iff the length of the rows is the same as the columns' (got a few other functions that determine whether the puzzle is legit). I figured it is enough to compare the first row to all the columns (or vice versa, doesn't make a difference). How can I create a check that goes something like:
for i in range(len(rows)):
if len(row[0]) == len(column[i]):
#do something only if all of the lengths check out
Use all:
if all(len(row[0]) == len(column[i]) for i in range(len(rows))):
#do something only if all of the lengths check out
You can run the check in a for loop and set a flag if there is any mismatch, this example checks all rows with all columns:
match = True
for r in row:
for c in column:
if len(c) != len(r):
match = False
# Only continue if match == True
Related
I need to go through a large pd and select consecutive rows with similar values in a column. i.e. in the pd below and selecting column x: I want to specify consecutive values in column x? Say if I want consecutive values of 3 and 5 only
col row x y
1 1 1 1
5 7 3 0
2 2 2 2
6 3 3 8
9 2 3 4
5 3 3 9
4 9 4 4
5 5 5 1
3 7 5 2
6 6 6 6
5 8 6 2
3 7 6 0
The results output would be:
col row x y consecutive-count
6 3 3 8 1
9 2 3 4 1
5 3 3 9 1
5 5 5 1 2
3 7 5 2 2
I tried
m = df['x'].eq(df['x'].shift())
df[m|m.shift(-1, fill_value=False)]
But that includes the consecutive 6 that I don't want.
I also tried:
df.query( 'x in [3,5]')
That prints every row where x has 3 or 5.
IIUC use masks for boolean indexing. Check for 3 or 5, and use a cummax and reverse cummax to ensure having the order:
m1 = df['x'].eq(3)
m2 = df['x'].eq(5)
out = df[(m1|m2)&(m1.cummax()&m2[::-1].cummax())]
Output:
col row x y
2 6 3 3 8
3 9 2 3 4
4 5 3 3 9
6 5 5 5 1
7 3 7 5 2
you can create a group column for consecutive values, and filter by the group count and value of x:
# create unique ids for consecutive groups, then get group length:
group_num = (df.x.shift() != df.x).cumsum()
group_len = group_num.groupby(group_num).transform("count")
# filter main df:
df2 = df[(df.x.isin([3,5])) & (group_len > 1)]
# add new group num col
df2['consecutive-count'] = (df2.x != df2.x.shift()).cumsum()
output:
col row x y consecutive-count
3 6 3 3 8 1
4 9 2 3 4 1
5 5 3 3 9 1
7 5 5 5 1 2
8 3 7 5 2 2
I have a dataframe with a bunch of Q&A sessions. Each time the speaker changes, the dataframe has a new row. I'm trying to assign question characteristics to the answers so I want to create an ID for each question-answer group. In the example below, I want to increment the id each time a new question is asked (speakertype_id == 3 => questions; speakertype_id == 4 => answers). I currently loop through the dataframe like so:
Q_A = pd.DataFrame({'qna_id':[9]*10,
'qnacomponentid':[3,4,5,6,7,8,9,10,11,12],
'speakertype_id':[3,4,3,4,4,4,3,4,3,4]})
group = [0]*len(Q_A)
j = 1
for index,row in enumerate(Q_A.itertuples()):
if row[3] == 3:
j+=1
group[index] = j
Q_A['group'] = group
This gives me the desired output and is much faster than I expected, but this post makes me question whether I should ever iterate over a pandas dataframe. Any thoughts on a better method? Thanks.
**Edit: Expected Output:
qna_id qnacomponentid speakertype_id group
9 3 3 2
9 4 4 2
9 5 3 3
9 6 4 3
9 7 4 3
9 8 4 3
9 9 3 4
9 10 4 4
9 11 3 5
9 12 4 5
you can use eq and cumsum like:
Q_A['gr2'] = Q_A['speakertype_id'].eq(3).cumsum()
print(Q_A)
qna_id qnacomponentid speakertype_id group gr2
0 9 3 3 2 1
1 9 4 4 2 1
2 9 5 3 3 2
3 9 6 4 3 2
4 9 7 4 3 2
5 9 8 4 3 2
6 9 9 3 4 3
7 9 10 4 4 3
8 9 11 3 5 4
9 9 12 4 5 4
Note that not sure if you have any reason to start at 2, but you can add +1 after the cumsum if it is a requirement
i reproduced as per your output:
Q_A['cumsum'] = Q_A[Q_A.speakertype_id!=Q_A.speakertype_id.shift()].groupby('speakertype_id').cumcount()+2
Q_A['cumsum'] = Q_A['cumsum'].ffill().astype('int')
I have a csv file, with number of lines is multiples of 16.
After reading, I want iterate and inspect each of 16 rows of data.
ex: following file has lines, which is multiple of 2
1 2 4
4 5 6
4 5 7
3 4 7
6 7 1
3 1 8
then I want to divide these lines into 3 tables
1 2 4
4 5 6
4 5 7
3 4 7
6 7 1
3 1 8
and iterate each of these individual table.
Thanks a lot
There are a lot of ways you can do this. One was is to use numpy to create the groupings and then use groupby to perform the iteration.
print(df)
a b c
0 1 2 4
1 4 5 6
2 4 5 7
3 3 4 7
4 6 7 1
5 3 1 8
groups = np.arange(len(df)) // 2
for idx, subset in df.groupby(groups):
print(subset)
print("-" * 10)
# prints:
a b c
0 1 2 4
1 4 5 6
----------
a b c
2 4 5 7
3 3 4 7
----------
a b c
4 6 7 1
5 3 1 8
----------
If you don't want the entire data at once, and just need specific number of rows at a time, you can consider reading the csv file in chunk rather than reading the entire data at once.
Something like this will work:
fileName = 'sample.csv'
batchSize = 16
for df in pd.read_csv(fileName, chunksize=batchSize):
process the chunk..
I am trying to reorganize a .txt file containing a list of data with traits in the columns and the family on the rows. Basically, I need to write a program that creates rows comparing the people in each family so that the traits persons 1 and 2, 1 and 3, and 2 and 3 are compared. i.e.:
A 1 2 7 8 9 10
A 1 3 7 9 9 11
etc.
where A is the family, the first 2 numbers are the people compared, the 3rd and 4th numbers are trait1 such as the measurements for each person, and the final numbers are trait2 such as the BMI values for each person.
My input is like this:
A 1 trait trait
A 2 trait trait
A 3 trait trait
I was able to create a data frame using:
data = pandas.read_csv('family.txt.', sep=" ", header = None)
print(data)
I cannot seem to figure out an efficient way to concatenate the data into the rows needed above. Any help is greatly appreciated!
Thank you
Ok, Consider your data was as follows
A 1 7 4 5 6
A 2 6 5 4 7
A 3 7 7 5 4
B 1 7 4 5 6
B 2 6 5 4 7
B 3 7 7 5 4
Where the first column is the family and the second column is the person_id and all subsequent columns are traits.
Some super dirty and super hastily written code below seems to give you what you want
file_lines = []
out_list = []
final_out = []
def read_file():
global file_lines
with open("sample.txt", 'r') as fd:
file_lines = fd.read().splitlines()
print file_lines
def make_output():
global file_lines, out_list, final_out
out_line = []
for line1 in file_lines:
for line2 in file_lines:
line1c = line1.split(" ")
line2c = line2.split(" ")
if line1c[0] == line2c[0]:
if line1c[1] >= line2c[1]:
continue
else:
out_list = []
out_list.append(line1c[0])
out_list.append(line1c[1])
out_list.append(line2c[1])
for i in range(2, len(line1c)):
out_list.append(line1c[i])
out_list.append(line2c[i])
print " ".join(out_list)
read_file()
make_output()
The output of print is
A 1 2 7 6 4 5 5 4 6 7
A 1 3 7 7 4 7 5 5 6 4
A 2 1 6 7 5 4 4 5 7 6
A 2 3 6 7 5 7 4 5 7 4
A 3 1 7 7 7 4 5 5 4 6
A 3 2 7 6 7 5 5 4 4 7
B 1 2 7 6 4 5 5 4 6 7
B 1 3 7 7 4 7 5 5 6 4
B 2 1 6 7 5 4 4 5 7 6
B 2 3 6 7 5 7 4 5 7 4
B 3 1 7 7 7 4 5 5 4 6
B 3 2 7 6 7 5 5 4 4 7
As you can see In family A person 1 is compared with 2 and 3. 2 is compared with 1 and 3 and 3 is compared with 1 and 2.
Obviously there will be duplication because each person is compared with every other person in the family twice.
It's trivial to remove this by maintaining a list of who has been compared with whom.
P.S: I know the script is really dirty but I just wanted to illustrate what i've done. Not write production code
EDIT: I wanted to write a slightly more complicated duplicate remover. But since the data is so simple a small modification in the continue criterion solved it. the output after this edit is
A 1 2 7 6 4 5 5 4 6 7
A 1 3 7 7 4 7 5 5 6 4
A 2 3 6 7 5 7 4 5 7 4
B 1 2 7 6 4 5 5 4 6 7
B 1 3 7 7 4 7 5 5 6 4
B 2 3 6 7 5 7 4 5 7 4
which is free of duplicates
I have a dataframe that contains number of observations per group of income:
INCAGG
1 6.561681e+08
3 9.712955e+08
5 1.658043e+09
7 1.710781e+09
9 2.356979e+09
I would like to compute the median income group. What do I mean?
Let's start with a simpler series:
INCAGG
1 6
3 9
5 16
7 17
9 23
It represents this set of numbers:
1 1 1 1 1 1
3 3 3 3 3 3 3 3 3
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
Which I can reorder to
1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 7 7 7 7 7
7 7 7 7 7 7 7 7 7 7 7 7 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
which visually is what I mean - the median here would be 7.
After glancing at a numpy example here, I think cumsum() provides a good approach. Assuming your column of counts is called 'wt', here's a simple solution that will work most of the time (and see below for a more general solution):
df = df.sort('incagg')
df['tmp'] = df.wt.cumsum() < ( df.wt.sum() / 2. )
df['med_grp'] = (df.tmp==False) & (df.tmp.shift()==True)
The second code line above is dividing into rows above and below the median. The median observation will be in the first False group.
incagg wt tmp med_grp
0 1 656168100 True False
1 3 971295500 True False
2 5 1658043000 True False
3 7 1710781000 False True
4 9 2356979000 False False
df.ix[df.med_grp,'incagg']
3 7
Name: incagg, dtype: int64
This will work fine when the median is unique and often when it isn't. The problem can only occur if the median is non-unique AND it falls on the edge of a group. In this case (with 5 groups and weights in the millions/billions), it's really not a concern but nevertheless here's a more general solution:
df['tmp1'] = df.wt.cumsum() == (df.wt.sum() / 2.)
df['tmp2'] = df.wt.cumsum() < (df.wt.sum() / 2.)
df['med_grp'] = (df.tmp2==False) & (df.tmp2.shift()==True)
df['med_grp'] = df.med_grp | df.tmp1.shift()
incagg wt tmp1 tmp2 med_grp
0 1 1 False True False
1 3 1 False True False
2 5 1 True False True
3 7 2 False False True
4 9 1 False False False
df.ix[df.med_grp,'incagg']
2 5
3 7
df.ix[df.med_grp,'incagg'].mean()
6.0
You can use chain from itertools. I used list comprehension to get a list of the aggregation group repeated the appropriate number of times, and then used chain to put it into a single list. Finally, I converted it to a Series and calculated the median:
from itertools import chain
df = pd.DataFrame([6, 9, 16, 17, 23], index=[1, 3, 5, 7, 9], columns=['counts'])
median = pd.Series([i for i in chain(*[[k] * v for k, v in zip(df.index, df.counts)])]).median()
>>> median
7.0