how to split csv file data in batches?

how to split csv file data in batches? - python

I have a csv file, with number of lines is multiples of 16.
After reading, I want iterate and inspect each of 16 rows of data.
ex: following file has lines, which is multiple of 2
1 2 4
4 5 6
4 5 7
3 4 7
6 7 1
3 1 8
then I want to divide these lines into 3 tables
1 2 4
4 5 6
4 5 7
3 4 7
6 7 1
3 1 8
and iterate each of these individual table.
Thanks a lot

There are a lot of ways you can do this. One was is to use numpy to create the groupings and then use groupby to perform the iteration.
print(df)
a b c
0 1 2 4
1 4 5 6
2 4 5 7
3 3 4 7
4 6 7 1
5 3 1 8
groups = np.arange(len(df)) // 2
for idx, subset in df.groupby(groups):
print(subset)
print("-" * 10)
# prints:
a b c
0 1 2 4
1 4 5 6
----------
a b c
2 4 5 7
3 3 4 7
----------
a b c
4 6 7 1
5 3 1 8
----------

If you don't want the entire data at once, and just need specific number of rows at a time, you can consider reading the csv file in chunk rather than reading the entire data at once.
Something like this will work:
fileName = 'sample.csv'
batchSize = 16
for df in pd.read_csv(fileName, chunksize=batchSize):
process the chunk..

Related

Create new column with a group ID that changes based on the value of another column

I have a dataframe with a bunch of Q&A sessions. Each time the speaker changes, the dataframe has a new row. I'm trying to assign question characteristics to the answers so I want to create an ID for each question-answer group. In the example below, I want to increment the id each time a new question is asked (speakertype_id == 3 => questions; speakertype_id == 4 => answers). I currently loop through the dataframe like so:
Q_A = pd.DataFrame({'qna_id':[9]*10,
'qnacomponentid':[3,4,5,6,7,8,9,10,11,12],
'speakertype_id':[3,4,3,4,4,4,3,4,3,4]})
group = [0]*len(Q_A)
j = 1
for index,row in enumerate(Q_A.itertuples()):
if row[3] == 3:
j+=1
group[index] = j
Q_A['group'] = group
This gives me the desired output and is much faster than I expected, but this post makes me question whether I should ever iterate over a pandas dataframe. Any thoughts on a better method? Thanks.
**Edit: Expected Output:
qna_id qnacomponentid speakertype_id group
9 3 3 2
9 4 4 2
9 5 3 3
9 6 4 3
9 7 4 3
9 8 4 3
9 9 3 4
9 10 4 4
9 11 3 5
9 12 4 5

you can use eq and cumsum like:
Q_A['gr2'] = Q_A['speakertype_id'].eq(3).cumsum()
print(Q_A)
qna_id qnacomponentid speakertype_id group gr2
0 9 3 3 2 1
1 9 4 4 2 1
2 9 5 3 3 2
3 9 6 4 3 2
4 9 7 4 3 2
5 9 8 4 3 2
6 9 9 3 4 3
7 9 10 4 4 3
8 9 11 3 5 4
9 9 12 4 5 4
Note that not sure if you have any reason to start at 2, but you can add +1 after the cumsum if it is a requirement

i reproduced as per your output:
Q_A['cumsum'] = Q_A[Q_A.speakertype_id!=Q_A.speakertype_id.shift()].groupby('speakertype_id').cumcount()+2
Q_A['cumsum'] = Q_A['cumsum'].ffill().astype('int')

Add together DataFrame rows with the same column values, but preserve ordering

I have a pandas DataFrame that looks like this:
a b c
8 3 3
4 3 3
5 3 3
1 9 4
7 3 1
1 3 3
6 3 3
9 7 7
1 7 7
I want to get a DataFrame like this:
a b c
17 3 3
1 9 4
7 3 1
7 3 3
10 7 7
Essentially, I want to add together the values in column a when the values in columns b and c are the same, but I want to do that in sections. groupby wouldn't work here because it would put the DataFrame out of order. I have an iterative solution, but it is messy and not very Pythonic. Is there a way to do this using the functions of the DataFrame?

Let us do shift with cumsum create the subgroup by key
s = df[['b','c']].ne(df[['b','c']].shift()).all(1).cumsum()
out = df.groupby([s,df.b,df.c]).agg({'a':'sum','b':'first','c':'first'}).reset_index(drop=True)
a b c
0 17 3 3
1 1 9 4
2 7 3 1
3 7 3 3
4 10 7 7

Try this:
df.groupby(['b', 'c', df[['b', 'c']].diff().any(axis=1).cumsum()], as_index=False)['a'].sum()
Output:
b c a
0 3 1 7
1 3 3 17
2 3 3 7
3 7 7 10
4 9 4 1

My dataframe saved as a csv file adds a new column automatically when I read the same file immediately

Here is a simple example.
I have created a dataframe
df = pd.DataFrame(np.random.randint(1,10,(10,3)),columns=['a','b','c'])
saved it down to my folder
df.to_csv('testing.csv')
but as soon as I read the same file
df = pd.read_csv('testing.csv)
it seems to be adding a new column automatically. Does anyone know what's happening here?
Unnamed: 0 a b c
0 0 4 5 6
1 1 1 5 1
2 2 8 6 2
3 3 7 9 7
4 4 3 2 6
5 5 9 1 2
6 6 4 1 3
7 7 3 3 3
8 8 5 3 7
9 9 4 3 8

An extra column in added while loading the csv because you are saving file with the index, that is the default index of the dataframe
add index=False while saving
df.to_csv('testing.csv', index=False)

Rearrange data file

I am trying to reorganize a .txt file containing a list of data with traits in the columns and the family on the rows. Basically, I need to write a program that creates rows comparing the people in each family so that the traits persons 1 and 2, 1 and 3, and 2 and 3 are compared. i.e.:
A 1 2 7 8 9 10
A 1 3 7 9 9 11
etc.
where A is the family, the first 2 numbers are the people compared, the 3rd and 4th numbers are trait1 such as the measurements for each person, and the final numbers are trait2 such as the BMI values for each person.
My input is like this:
A 1 trait trait
A 2 trait trait
A 3 trait trait
I was able to create a data frame using:
data = pandas.read_csv('family.txt.', sep=" ", header = None)
print(data)
I cannot seem to figure out an efficient way to concatenate the data into the rows needed above. Any help is greatly appreciated!
Thank you

Ok, Consider your data was as follows
A 1 7 4 5 6
A 2 6 5 4 7
A 3 7 7 5 4
B 1 7 4 5 6
B 2 6 5 4 7
B 3 7 7 5 4
Where the first column is the family and the second column is the person_id and all subsequent columns are traits.
Some super dirty and super hastily written code below seems to give you what you want
file_lines = []
out_list = []
final_out = []
def read_file():
global file_lines
with open("sample.txt", 'r') as fd:
file_lines = fd.read().splitlines()
print file_lines
def make_output():
global file_lines, out_list, final_out
out_line = []
for line1 in file_lines:
for line2 in file_lines:
line1c = line1.split(" ")
line2c = line2.split(" ")
if line1c[0] == line2c[0]:
if line1c[1] >= line2c[1]:
continue
else:
out_list = []
out_list.append(line1c[0])
out_list.append(line1c[1])
out_list.append(line2c[1])
for i in range(2, len(line1c)):
out_list.append(line1c[i])
out_list.append(line2c[i])
print " ".join(out_list)
read_file()
make_output()
The output of print is
A 1 2 7 6 4 5 5 4 6 7
A 1 3 7 7 4 7 5 5 6 4
A 2 1 6 7 5 4 4 5 7 6
A 2 3 6 7 5 7 4 5 7 4
A 3 1 7 7 7 4 5 5 4 6
A 3 2 7 6 7 5 5 4 4 7
B 1 2 7 6 4 5 5 4 6 7
B 1 3 7 7 4 7 5 5 6 4
B 2 1 6 7 5 4 4 5 7 6
B 2 3 6 7 5 7 4 5 7 4
B 3 1 7 7 7 4 5 5 4 6
B 3 2 7 6 7 5 5 4 4 7
As you can see In family A person 1 is compared with 2 and 3. 2 is compared with 1 and 3 and 3 is compared with 1 and 2.
Obviously there will be duplication because each person is compared with every other person in the family twice.
It's trivial to remove this by maintaining a list of who has been compared with whom.
P.S: I know the script is really dirty but I just wanted to illustrate what i've done. Not write production code
EDIT: I wanted to write a slightly more complicated duplicate remover. But since the data is so simple a small modification in the continue criterion solved it. the output after this edit is
A 1 2 7 6 4 5 5 4 6 7
A 1 3 7 7 4 7 5 5 6 4
A 2 3 6 7 5 7 4 5 7 4
B 1 2 7 6 4 5 5 4 6 7
B 1 3 7 7 4 7 5 5 6 4
B 2 3 6 7 5 7 4 5 7 4
which is free of duplicates

TypeError: unhashable type: 'slice' for pandas

I have a pandas datastructure, which I create like this:
test_inputs = pd.read_csv("../input/test.csv", delimiter=',')
Its shape
print(test_inputs.shape)
is this
(28000, 784)
I would like to print a subset of its rows, like this:
print(test_inputs[100:200, :])
print(test_inputs[100:200, :].shape)
However, I am getting:
TypeError: unhashable type: 'slice'
Any idea what could be wrong?

Indexing in pandas is really confusing, as it looks like list indexing but it is not. You need to use .iloc, which is indexing by position
print(test_inputs.iloc[100:200, :])
And if you don't use column selection you can omit it
print(test_inputs.iloc[100:200])
P.S. Using .loc is not what you want, as it would look not for the row number, but for the row index (which can be filled we anything, not even numbers, not even unique). Ranges in .loc will find rows with index value 100 and 200, and return the lines between. If you just created the DataFrame .iloc and .loc may give the same result, but using .loc in this case is a very bad practice as it will lead you to difficult to understand problem when the index will change for some reason (for example you'll select some subset of rows, and from that moment the row number and index will not be the same).
P.P.S. You can use test_inputs[100:200], but not test_inputs[100:200, :] because pandas designers tried to combine different popular approaches into one construction. And test_input['column'] equals to test_input.loc[:, 'column'], but surprisingly slicing with integers test_input[100:200] equals to test_inputs.iloc[100:200] (while slicing with not integer values is equivalent to loc row slicing). And if you pass a pair of values to [] it considers as a tuple for multilevel column indexing so multi_level_columns_df['level_1', 'level_2'] is equivalent to multi_level_columns_df.loc[:, ('level_1', 'level_2')]. That is why your original construction led to the error: slice can't be used as a part of multilevel index.

There is more possible solutions, but output is not same:
loc selects by labels, but iloc and slicing without function, the start bounds is included, while the upper bound is excluded, docs - select by positions:
test_inputs = pd.DataFrame(np.random.randint(10, size=(28, 7)))
print(test_inputs.loc[10:20])
0 1 2 3 4 5 6
10 3 2 0 6 6 0 0
11 5 0 2 4 1 5 2
12 5 3 5 4 1 3 5
13 9 5 6 6 5 0 1
14 7 0 7 4 2 2 5
15 2 4 3 3 7 2 3
16 8 9 6 0 5 3 4
17 1 1 0 7 2 7 7
18 1 2 2 3 5 8 7
19 5 1 1 0 1 8 9
20 3 6 7 3 9 7 1
print(test_inputs.iloc[10:20])
0 1 2 3 4 5 6
10 3 2 0 6 6 0 0
11 5 0 2 4 1 5 2
12 5 3 5 4 1 3 5
13 9 5 6 6 5 0 1
14 7 0 7 4 2 2 5
15 2 4 3 3 7 2 3
16 8 9 6 0 5 3 4
17 1 1 0 7 2 7 7
18 1 2 2 3 5 8 7
19 5 1 1 0 1 8 9
print(test_inputs[10:20])
0 1 2 3 4 5 6
10 3 2 0 6 6 0 0
11 5 0 2 4 1 5 2
12 5 3 5 4 1 3 5
13 9 5 6 6 5 0 1
14 7 0 7 4 2 2 5
15 2 4 3 3 7 2 3
16 8 9 6 0 5 3 4
17 1 1 0 7 2 7 7
18 1 2 2 3 5 8 7
19 5 1 1 0 1 8 9

print(test_inputs.values[100:200, :])
print(test_inputs.values[100:200, :].shape)
This code is also working for me.

I was facing the same problem. Even the above solutions couldn't fix it. It was some problem with pandas, What I did was I changed the array into a numpy array that fixed the issue.
import pandas as pd
import numpy as np
test_inputs = pd.read_csv("../input/test.csv", delimiter=',')
test_inputs = np.asarray(test_inputs)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to split csv file data in batches? - python

Related

Create new column with a group ID that changes based on the value of another column

Add together DataFrame rows with the same column values, but preserve ordering

My dataframe saved as a csv file adds a new column automatically when I read the same file immediately

Rearrange data file

TypeError: unhashable type: 'slice' for pandas

Categories

Resources