Python: select random values for a column from csv

Python: select random values for a column from csv - python

I have a problem to print random values from a csv for a given column name/index (my second day in Python world :) )
I have so far managed to write the following -
#!/usr/bin/python
import csv # This will help us reading csv formated files.
import random # This will random method
load_file= open('<filename>',"rb")
reader= csv.reader(load_file) #The reader method will put each line
# of the csv file into a list of columns
for row in reader:
from random import shuffle
shuffle(row[2])
print row[2]
load_file.close();
It is printing a shuffled (random) values from the third column in the file.
objectives -
. define the number of values 1000,2000,50000 etc.
. The values are highly skewed how to ensure uniform distribution ? e.g. if the column has got mostly 0s & few 1s I want to see both values in the output for any sample size.
. write this into a file. (not urgent at this point)
I am using python 2.6.6

Here is an unrelated example to show you how the shuffle and pop methods can be used:
from random import shuffle
a = [1,2,3,4,5]
shuffle(a)
print a
[5,1,4,2,3]
print a.pop()
3
print a
[5,1,4,2]
The pop method without any arguments deletes the last element from a list and then returns it. However, since you are shuffling the list before hand, you will get a random sequence everytime.

From what I understand, you want to do this:
Read a CSV file with an unknown number of rows;
Gather all the items in a given column, say column 2;
Choose at random one row from that column.
If that is correct, it is fairly easy to do.
Suppose we have a CSV file like so:
1,2,3,4
5,6,7,8
9,10,11,12
13,14,15,16
Usually you would deal with a CSV file row by row. Since you want all the data from a column, you need to read the entire file before you have a set of data you can work with since the total number of rows is not known.
Here is a way:
import csv
col=2
with open(fn, 'r') as f:
reader=csv.reader(f)
data=[row[col] for row in reader]
print data
# ['3', '7', '11', '15']
Then if you want a single random number out of that list, use random.choice(data)
If you want to shuffle all the items in that column, use random.shuffle(data) then print it as a column using something like print '\n'.join(data) if all the elements of data are strings.

Thanks #dawg, #sshashank124 and others -
here is the code -
#!/usr/bin/python
import csv # This will help us reading csv formated files.
import random # random method
col=2
with open('<filename>','r') as f:
reader=csv.reader(f)
data=[row[col] for row in reader]
from random import shuffle
shuffle(data)
print '\n'.join(data[:100])
f.close();
It is giving me output in the form of a column.
I am going to try to write it as a function and add other features next. I might start a separate thread for that.

Related

How can I periodically skip rows reading txt with pandas?

I need to process data measured every 20 seconds during the whole 2018 year, the raw file has following structure:
date time a lot of trash
in several rows
amount of samples trash again
data
date time a lot of trash
etc.
I want to make one pandas dataframe of it or at least one dataframe per every block (its size is coded as amount of samples) of data saving the time of measurement.
How can I ignore all other data trash? I know that it is written periodically (period = amount of samples), but:
- I don't know how many strings are in file
- I don't want to use explicit method file.getline() in cycle, because it would work just endlessly (especially in python) and I have no enough computing power to use it
Is there any method to skip rows periodically in pandas or another lib? Or how else can I resolve it?
There is an example of my data:
https://drive.google.com/file/d/1OefLwpTaytL7L3WFqtnxg0mDXAljc56p/view?usp=sharing
I want to get dataframe similar to datatable on the pic + additional column with date-time without technical rows

Use itertools.islice, where N below means read every N lines
from itertools import islice
N = 3
sep = ','
with open(file_path, 'r') as f:
lines_gen = islice(f, None, None, N)
df = pd.DataFrame([x.strip().split(sep) for x in lines_gen])

I repeated your data three times. It sounds like you need every 4th row (not starting at 0) because that is where your data lies. In the documentation for skipsrows it says.
If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be lambda x: x in [0, 2].
So what if we pass a not in to the lambda function? that is what I am doing below.
I am creating a list of the values i want to keep. and passing the not in to the skiprows argument. In English, skip all the rows that are not every 4th line.
import pandas as pd
# creating a list of all the 4th row indexes. If you need more than 1 million, just up the range number
list_of_rows_to_keep = list(range(0,1000000))[3::4]
# passing this list to the lambda function using not in.
df = pd.read_csv(r'PATH_To_CSV.csv', skiprows=lambda x: x not in list_of_rows_to_keep)
df.head()
#output
0 data
1 data
2 data

Just count how many lines are in file and put the list of them (may it calls useless_rows) which are supposed to be skiped in pandas.read_csv(..., skiprows=useless_rows).
My problem was a chip rows counting.
There are few ways to do it:
On Linux command "wc -l" (here is an instruction how to put it into your code: Running "wc -l <filename>" within Python Code)
Generators. I have a key in my relevant rows: it is in last column. Not really informative, but rescue for me. So I can count string with it, appears it's abour 500000 lines and it took 0.00011 to count
with open(filename) as f:
for row in f:
if '2147483647' in row:
continue
yield row

Using Python built-ins only, is it possible to read in only a specified set of columns to add to a Python dictionary?

I have the header name of a column from a series of massive csv files with 50+ fields. Across the files, the index of the column I need is not always the same.
I have written code that finds the index number of the column in each file. Now I'd like to add only this column as the key in a dictionary where the value counts the number of unique strings in this column.
Because these csv files are massive and I'm trying to use best-practices for efficient data engineering, I'm looking for a solution that uses minimal memory. Every solution I find for writing a csv to a dictionary involves writing all of the data in the csv to the dictionary and I don't think this is necessary. It seems that the best solution involves only reading in the data from this one column and adding this column to the dictionary key.
So, let's take this as sample data:
FOODS;CALS
"PIZZA";600
"PIZZA";600
"BURGERS";500
"PIZZA";600
"PASTA";400
"PIZZA";600
"SALAD";100
"CHICKEN WINGS";300
"PIZZA";600
"PIZZA";600
The result I want:
food_dict = {'PIZZA': 6, 'PASTA': 1, 'BURGERS': 1, 'SALAD': 1, 'CHICKEN WINGS': 1}
Now let's say that I want the data from only the FOODS column and in this case, I have set the index value as the variable food_index.
Here's what I have tried, the problem being that the columns are not always in the same index location across the different files, so this solution won't work:
from itertools import islice
with open(input_data_txt, "r") as file:
# This enables skipping the header line.
skipped = islice(file, 1, None)
for i, line in enumerate(skipped, 2):
try:
food, cals = line.split(";")
except ValueError:
pass
food_dict = {}
if food not in food_dict:
food_dict[food] = 1
else:
food_dict[food] += 1
This solution works for only this sample -- but only if I know the location of the columns ahead of time -- and again, a reminder that I have upwards of 50 columns and the index position of the column I need is different across files.
Is it possible to do this? Again, built-ins only -- no Pandas or Numpy or other such packages.

The important part here is that you do not skip the header line! You need to split that line and find the indices of the columns you need! Since you know the column headers for the information you need, put those into a reference list:
wanted_headers = ["FOODS", "RECYCLING"]
with open(input_data_txt, "r") as infile:
header = infile.read().split(';')
wanted_cols = [header.index(label) for label in wanted_headers if label in header]
# wanted_cols is now a list of column numbers you want
for line in infile.readlines(): # Iterate through remaining file
fields = line.split(';')
data = [fields[col] for col in wanted_cols]
You now have the data in the same order as your existing headers; you can match it up or rearrange as needed.
Does that solve your blocking point? I've left plenty of implementation for you ...

Use Counter and csv:
from collections import Counter
import csv
with open(filename) as f:
reader = csv.reader(f)
next(reader, None) # skips header
histogram = Counter(line[0] for line in reader)

Searching for specific text in csv(excel format) file

CVS Sample
So I have a csv file(sample in link above) , with variable names in row 7 and values in row 8 . The Variable all have units after them, and the values are just numbers like this :
Velocity (ft/s) Volumetric (Mgal/d Mass Flow (klb/d) Sound Speed (ft/s)
.-0l.121 1.232 1.4533434 1.233423
There are alot more variables, but basically I need some way to search in the csv file for the specefic unit groups, and then append the value associated with that in a list. For example search for text "(ft/s)", and then make a dictionary with Velocity and Sound speed as Keys, and their associated values . I am unable to do this because the csv is formatted like an excel spreadsheet, and the cells contains the whole variable name with it's unit
In the end I will have a dictionary for each unit group, and I need to do it this way because each csv file generated, the unit groups change ( ft/s becomes m/s). I also can't use excel read, because it doesn't work in IronPython.

You can use csv module to read the appropriate lines into lists.
defaultdict is a good choice for data aggregation, while variable
names and units can be easily separated by splitting on '('.
import csv
import collections
with open(csv_file_name) as fp:
reader = csv.feader(fp)
for k in range(6): # skip 6 lines
next(reader)
varnames = next(reader) # 7th line
values = next(reader) # 8th line
groups = collections.defaultdict(dict)
for i, (col, value) in enumerate(zip(varnames, values)):
if i < 2:
continue
name, units = map(str.strip, col.strip(')').split('(', 1))
groups[units][name] = float(value)
Edit: added the code to skip first two columns

I'll help with the part I think you're stuck on, which is trying to extract the units from the category. Given your data, your best bet may be to use regex, the following should work:
import re
f = open('data.csv')
# I assume the first row has the header you listed in your question
header = f.readline().split(',') #since you said its a csv
for item in header:
print re.search(r'\(.+\)', item).group()
print re.sub(r'\(.+\)', '', item)
That should print the following for you:
(ft/s)
Velocity
(Mgal/d)
Volumetric
(klb/d)
Mass Flow
(ft/s)
Sound Speed
You can modify the above to store these in a list, then iterate through them to find duplicates and merge the appropriate strings to dictionaries or whatnot.

pairwise comparison within a column pandas python Biopython

i have a large data set that i read in with pandas and i want to do pairwise alignment by pairwise2.
import pandas as pd
from pandas import DataFrame
from Bio import pairwise2 #for pairwise alignments
from Bio.pairwise2 import format_alignment #for printing alignments out neatly
but here i will use a mock data set:
data = { 'sequence': ['ACAAGAGTGGGACTATACAGTGGGTACAGTTATGACTTC', 'GCACGGGCCCTTGGCTAC', 'GCAACAAGGGGGGATACAGCGGGAACAGTGGACAAGTGGTTCGATGTC']}
data = DataFrame(data)
look like this:
Out[34]:
sequence
0 ACAAGAGTGGGACTATACAGTGGGTACAGTTATGACTTC
1 GCACGGGCCCTTGGCTAC
2 GCAACAAGGGGGGATACAGCGGGAACAGTGGACAAGTGGTTCGATGTC
my goal is to do a pairwise alignment within the 'sequence' column, so the first row compares with the second, then the second compares with the third, the third compares with the first, and so on for a larger data set.
my code :
for seq in data['sequence']:
for a in pairwise2.align.globalxx(seq, seq):
print(format_alignment(*a)) #this is just to print the alignment out neatly.
this prints out:
ACAAGAGTGGGACTATACAGTGGGTACAGTTATGACTTC
|||||||||||||||||||||||||||||||||||||||
ACAAGAGTGGGACTATACAGTGGGTACAGTTATGACTTC
Score=39
GCACGGGCCCTTGGCTAC
||||||||||||||||||
GCACGGGCCCTTGGCTAC
Score=18
GCAACAAGGGGGGATACAGCGGGAACAGTGGACAAGTGGTTCGATGTC
||||||||||||||||||||||||||||||||||||||||||||||||
GCAACAAGGGGGGATACAGCGGGAACAGTGGACAAGTGGTTCGATGTC
Score=48
which is close to what i want but it only compares the first to the first, second to second and third to third.
so i tried this:
for seq in data['sequence']: #for each 'sequence' column value
for index, row in data.iterrows(): #for each row
for a in pairwise2.align.globalxx(seq, row['sequence']): #compare 'sequence' column value to each row of the 'sequence' column
print(format_alignment(*a))
this gave out way too many lines of output i'm not even going to try to post it here.
my idea was to compare the 'sequence' value to the rows of the 'sequence' column, but the output gave way too many alignments than expected. i think the double loop is not the way to go here. i guess my question doesn't even have anything to do with Biopython, just simply how can i do pairwise comparisons within one column?

Use the combinatoric generators from itertools.
for seq0, seq1 in itertools.combinations(data['sequence'], 2):
for a in pairwise2.align.globalxx(seq0, seq1):
print(format_alignment(*a))

CSV find max in column and append new data

I asked a question about two hours ago regarding the reading and writing of data from a website. I've spent the last two hours since then trying to find a way to read the maximum date value from column 'A' of the output, comparing that value to the refreshed website data, and appending any new data to the csv file without overriding the old ones or creating duplicates.
The code that is currently 100% working is this:
import requests
symbol = "mtgoxUSD"
url = 'http://api.bitcoincharts.com/v1/trades.csv?symbol={}'.format(symbol)
data = requests.get(url)
with open("trades_{}.csv".format(symbol), "r+") as f:
f.write(data.text)
I've tried various ways of finding the maximum value of column 'A'. I've tried a bunch of different ways of using "Dict" and other methods of sorting/finding max, and even using pandas and numpy libs. None of which seem to work. Could someone point me in the direction of a decent way to find the maximum of a column from the .csv file? Thanks!

if you have it in a pandas DataFrame, you can get the max of any column like this:
>>> max(data['time'])
'2012-01-18 15:52:26'
where data is the variable name for the DataFrame and time is the name of the column

I'll give you two answers, one that just returns the max value, and one that returns the row from the CSV that includes the max value.
import csv
import operator as op
import requests
symbol = "mtgoxUSD"
url = 'http://api.bitcoincharts.com/v1/trades.csv?symbol={}'.format(symbol)
csv_file = "trades_{}.csv".format(symbol)
data = requests.get(url)
with open(csv_file, "w") as f:
f.write(data.text)
with open(csv_file) as f:
next(f) # discard first row from file -- see notes
max_value = max(row[0] for row in csv.reader(f))
with open(csv_file) as f:
next(f) # discard first row from file -- see notes
max_row = max(csv.reader(f), key=op.itemgetter(0))
Notes:
max() can directly consume an iterator, and csv.reader() gives us an iterator, so we can just pass that in. I'm assuming you might need to throw away a header line so I showed how to do that. If you had multiple header lines to discard, you might want to use islice() from the itertools module.
In the first one, we use a "generator expression" to select a single value from each row, and find the max. This is very similar to a "list comprehension" but it doesn't build a whole list, it just lets us iterate over the resulting values. Then max() consumes the iterable and we get the max value.
max() can use a key= argument where you specify a "key function". It will use the key function to get a value and use that value to figure the max... but the value returned by max() will be the unmodified original value (in this case, a row value from the CSV). In this case, the key function is manufactured for you by operator.itemgetter()... you pass in which column you want, and operator.itemgetter() builds a function for you that gets that column.
The resulting function is the equivalent of:
def get_col_0(row):
return row[0]
max_row = max(csv.reader(f), key=get_col_0)
Or, people will use lambda for this:
max_row = max(csv.reader(f), key=lambda row: row[0])
But I think operator.itemgetter() is convenient and nice to read. And it's fast.
I showed saving the data in a file, then pulling from the file again. If you want to go through the data without saving it anywhere, you just need to iterate over it by lines.
Perhaps something like:
text = data.text
rows = [line.split(',') for line in text.split("\n") if line]
rows.pop(0) # get rid of first row from data
max_value = max(row[0] for row in rows)
max_row = max(rows, key=op.itemgetter(0))
I don't know which column you want... column "A" might be column 0 so I used 0 in the above. Replace the column number as you like.

It seems like something like this should work:
import requests
import csv
symbol = "mtgoxUSD"
url = 'http://api.bitcoincharts.com/v1/trades.csv?symbol={}'.format(symbol)
data = requests.get(url)
with open("trades_{}.csv".format(symbol), "r+") as f:
all_values = list(csv.reader(f))
max_value = max([int(row[2]) for row in all_values[1:]])
(write-out-the-value?)
EDITS: I used "row[2]" because that was the sample column I was taking max of in my csv. Also, I had to strip off the column headers, which were all text, which was why I looked at "all_values[1:]" from the second row to the end of the file.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: select random values for a column from csv - python

Related

How can I periodically skip rows reading txt with pandas?

Using Python built-ins only, is it possible to read in only a specified set of columns to add to a Python dictionary?

Searching for specific text in csv(excel format) file

pairwise comparison within a column pandas python Biopython

CSV find max in column and append new data

Categories

Resources