Structuring NumPy Arrays - python

a university task has us constructing a program that reads CSV files and prepares them for analysis. In a previous question, we constructed a function to open and read certain columns within an input CSV file. Here is the code that was written for such a question:
import csv
import numpy
def load_metrics(filename):
"""Loads data from csv files"""
data = []
with open(filename) as csv_file:
csv_reader = csv.reader(csv_file)
for row in csv_reader:
data.append(row[0:2] + row[7:14])
return numpy.array(data)
However, this next question has me stumped, especially with mentions of unicode. Any idea how I should tackle this question? I believe I should start by removing the header row, but am unsure where to go from there. Thank you.
The NumPy array you created from task 1 is unstructured because we let NumPy decide what the datatype for each value should be. Also, it contains the header row that is not necessary for the analysis. Typically, it contains float values, with some description columns like created_at etc. So, we are going to remove the header row, and we are also going to explicitly tell NumPy to convert all columns to type float (i.e., "float") apart from columns specified by indexes, which should be Unicode of length 30 characters (i.e., "<U30"). Finally, every row is converted as a type tuple (e.g., tuple(i) for i in data).
Write a function unstructured_to_structured(data, indexes) that achieves the above goal.

Related

How do I save a date input from csv as a variable PYTHON

I need my program to take one cell in a csv that is a date and save it as a variable, then read the next cell in the column and save it as the next variable. The rest of the program does a calculation with the dates, but i am not sure how to save each date as a variable. (The part that is commented out is what it would look like if a user did the input themsef, but I would like it read in from the csv)
enter image description here
You can convert csv to numpy with:
from numpy import genfromtxt
my_data = genfromtxt('my_file.csv', delimiter=',')
And then you can just iterate through the created array to get the elements. Be careful as how the delimiter is defined, here it's a coma, sometimes there isn't any, so just make sure you know how your csv data are delimited.

Replace a row in a pandas dataframe with values from dictionary

I am trying to populate an empty dataframe by using the csv module to iterate over a large tab-delimited file, and replacing each row in the dataframe with these values. (Before you ask, yes I have tried all the normal read_csv methods, and nothing has worked because of dtype issues and how large the file is).
I first made an empty numpy array using np.empty, using the dimensions of my data. I then converted this to a pandas DataFrame. Then, I did the following:
with open(input_file) as csvfile:
reader = csv.DictReader(csvfile,delimiter='\t')
row_num = 0
for row in reader:
for key, value in row.items():
df.loc[row_num,key] = value
row_num += 1
This is working great, except that my file has 900,000 columns, so it is unbelievably slow. This also feels like something that pandas could do more efficiently, but I've been unable to find how. The dictionary for each row given by DictReader looks like:
{'columnName1':<value>,'columnName2':<value> ...}
Where the values are what I want to put in the dataframe in those columns for that row.
Thanks!
So what you could do in this case is to build smaller chunks of your big csv data file. I had the same issue with a 32GB Csv-File, so I had to build chunks. After reading them in you could work with them.
# read the large csv file with specified chunksize
df_chunk = pd.read_csv(r'../input/data.csv', chunksize=1000000)
chunksize=1000000 sets how many row are read in at once
Helpfull website:
https://towardsdatascience.com/why-and-how-to-use-pandas-with-large-data-9594dda2ea4c

how to write comma separated list items to csv in a single column in python

I have a list(fulllist) of 292 items and converted to data frame. Then tried writing it to csv in python.
import pandas as pd
my_df = pd.DataFrame(fulllist)
my_df.to_csv('Desktop/pgm/111.csv', index=False,sep=',')
But the some comma separated values fills each columns of csv. I am trying to make that values in single column.
Portion of output is shown below.
I have tried with writerows but wont work.
import csv
with open('Desktop/pgm/111.csv', "wb") as f:
writer = csv.writer(fulllist)
writer.writerows(fulllist)
Also tried with "".join at each time, when the length of list is higher than 1. It also not giving the result. How to make the proper csv so that each fields fill each columns?
My expected output csv is
Please keep in mind that .csv files are in fact plain text files and understanding of .csv by given software depends on implementation, for example some might allow newline character as part of field, when it is between " and ", while other treat every newline character as next row.
Do you have to use .csv format? If not consider other possibilities:
DSV https://en.wikipedia.org/wiki/Delimiter-separated_values is similiar to csv, but you can use for example ; instead of ,, which should help if you do not have ; in your data
openpyxl allows writing and reading of .xlsx files.

Extracting columns containing a certain name

I'm trying to use it to manipulate data in large txt-files.
I have a txt-file with more than 2000 columns, and about a third of these have a title which contains the word 'Net'. I want to extract only these columns and write them to a new txt file. Any suggestion on how I can do that?
I have searched around a bit but haven't been able to find something that helps me. Apologies if similar questions have been asked and solved before.
EDIT 1: Thank you all! At the moment of writing 3 users have suggested solutions and they all work really well. I honestly didn't think people would answer so I didn't check for a day or two, and was happily surprised by this. I'm very impressed.
EDIT 2: I've added a picture that shows what a part of the original txt-file can look like, in case it will help anyone in the future:
One way of doing this, without the installation of third-party modules like numpy/pandas, is as follows. Given an input file, called "input.csv" like this:
a,b,c_net,d,e_net
0,0,1,0,1
0,0,1,0,1
(remove the blank lines in between, they are just for formatting the
content in this post)
The following code does what you want.
import csv
input_filename = 'input.csv'
output_filename = 'output.csv'
# Instantiate a CSV reader, check if you have the appropriate delimiter
reader = csv.reader(open(input_filename), delimiter=',')
# Get the first row (assuming this row contains the header)
input_header = reader.next()
# Filter out the columns that you want to keep by storing the column
# index
columns_to_keep = []
for i, name in enumerate(input_header):
if 'net' in name:
columns_to_keep.append(i)
# Create a CSV writer to store the columns you want to keep
writer = csv.writer(open(output_filename, 'w'), delimiter=',')
# Construct the header of the output file
output_header = []
for column_index in columns_to_keep:
output_header.append(input_header[column_index])
# Write the header to the output file
writer.writerow(output_header)
# Iterate of the remainder of the input file, construct a row
# with columns you want to keep and write this row to the output file
for row in reader:
new_row = []
for column_index in columns_to_keep:
new_row.append(row[column_index])
writer.writerow(new_row)
Note that there is no error handling. There are at least two that should be handled. The first one is the check for the existence of the input file (hint: check the functionality provide by the os and os.path modules). The second one is to handle blank lines or lines with an inconsistent amount of columns.
This could be done for instance with Pandas,
import pandas as pd
df = pd.read_csv('path_to_file.txt', sep='\s+')
print(df.columns) # check that the columns are parsed correctly
selected_columns = [col for col in df.columns if "net" in col]
df_filtered = df[selected_columns]
df_filtered.to_csv('new_file.txt')
Of course, since we don't have the structure of your text file, you would have to adapt the arguments of read_csv to make this work in your case (see the the corresponding documentation).
This will load all the file in memory and then filter out the unnecessary columns. If your file is so large that it cannot be loaded in RAM at once, there is a way to load only specific columns with the usecols argument.
You can use pandas filter function to select few columns based on regex
data_filtered = data.filter(regex='net')

how to write a string to csv?

help please write data horizontally in csv-file.
the following code writes this:
import csv
with open('some.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows('jgjh')
but I need so
The csv module deals in sequences; use writer.writerow() (singular) and give it a list of one column:
writer.writerow(['jgjh'])
The .writerow() method will take each element in the sequence you give it and make it one column. Best to give it a list of columns, and ['jgjh'] makes one such column.
.writerows() (plural) expects a sequence of such rows; for your example you'd have to wrap the one row into another list to make it a series of rows, so writer.writerows([['jgjh']]) would also achieve what you want.

Categories

Resources