Python read csv, with four column - python

Data from datadog
I am looking for some assistance reading this data from Datadog, I am reading it from the downloaded cvs. Wants to read in python so that create an application for the reading the same on regular intervals.
I have tried reading the data like below
import pandas as pd
fileload = pd.read_csv("DataSource/extract-2023-02-02T19_10_32.790Z.csv")
print(fileload)
fileload1 = pd.read_csv("DataSource/extract-2023-02-02T19_11_05.899Z.csv")
final = pd.concat([fileload, fileload1])
print(final)````
import csv
with open("DataSource/extract-2023-02-02T19_10_32.790Z.csv", 'r' ) as file:
csvread = csv.reader(file)
for i in file:
print(i)
a = pd.DataFrame([csvread])
print(type(a))
My expectation is that i can pick the last column with the all the data in the above format and further give column names to it. and then analyse data applying some aggregations on top.
Please assist

Have you tried:
final[["final_column_name"]]
final['New_col_name'] = ...

Related

Pandas read_csv incorrectly naming columns

I am trying to import a Leukemia gene expression data set found at https://www.kaggle.com/brunogrisci/leukemia-gene-expression-cumida. This data set has a lot of columns (22285) and the columns imported towards the end have an incorrect name. For example the last column named AFFX-r2-P1-cre-3_at is actually called 217005_at in the csv file. The image below shows my juypter notebook cells. I am not sure why it is being formatted this way? Any help would be greatly appreciated.
Evidently the CSV file has column names that start with 'AFFX-r2-P1' -- it's not a pandas issue. Using the built-in csv package shows:
import csv
from pathlib import Path
data_file = Path('../../../Downloads/Leukemia_GSE9476.csv')
with open(data_file, 'rt') as lines:
csv_file = csv.reader(lines)
fields = next(csv_file)
#
[
(field_number, field)
for field_number, field in enumerate(fields)
if field.startswith('AFFX-r2-P1')
]
The output is:
[(22277, 'AFFX-r2-P1-cre-3_at'), (22278, 'AFFX-r2-P1-cre-5_at')]

Grab values from seperate csv file and replace the values of columns in a pipe delimited file

Trying to whip this out in python. Long story short I got a csv file that contains column data i need to inject into another file that is pipe delimited. My understanding is that python can't replace values, so i have to re-write the whole file with the new values.
data file(csv):
value1,value2,iwantthisvalue3
source file(txt, | delimited)
value1|value2|iwanttoreplacethisvalue3|value4|value5|etc
fixed file(txt, | delimited)
samevalue1|samevalue2| replacedvalue3|value4|value5|etc
I can't figure out how to accomplish this. This is my latest attempt(broken code):
import re
import csv
result = []
row = []
with open("C:\data\generatedfixed.csv","r") as data_file:
for line in data_file:
fields = line.split(',')
result.append(fields[2])
with open("C:\data\data.txt","r") as source_file, with open("C:\data\data_fixed.txt", "w") as fixed_file:
for line in source_file:
fields = line.split('|')
n=0
for value in result:
fields[2] = result[n]
n=n+1
row.append(line)
for value in row
fixed_file.write(row)
I would highly suggest you use the pandas package here, it makes handling tabular data very easy and it would help you a lot in this case. Once you have installed pandas import it with:
import pandas as pd
To read the files simply use:
data_file = pd.read_csv("C:\data\generatedfixed.csv")
source_file = pd.read_csv('C:\data\data.txt', delimiter = "|")
and after that manipulating these two files is easy, I'm not exactly sure how many values or which ones you want to replace, but if the length of both "iwantthisvalue3" and "iwanttoreplacethisvalue3" is the same then this should do the trick:
source_file['iwanttoreplacethisvalue3'] = data_file['iwantthisvalue3]
now all you need to do is save the dataframe (the table that we just updated) into a file, since you want to save it to a .txt file with "|" as the delimiter this is the line to do that (however you can customize how to save it in a lot of ways):
source_file.to_csv("C:\data\data_fixed.txt", sep='|', index=False)
Let me know if everything works and this helped you. I would also encourage to read up (or watch some videos) on pandas if you're planning to work with tabular data, it is an awesome library with great documentation and functionality.

Extract a particular value of csv file without uploading whole file

So I have a several tables in the format of csv, I am using Python and the csv module. I want to extract a particular value, lets say column=80 row=109.
Here is a random example:
import csv
with open('hugetable.csv', 'r') as file:
reader = csv.reader(file)
print(reader[109][80])
I am doing this many times with large tables and I would like to avoid loading the whole table into an array (line 2 above) to ask for a single value. Is there a way to open the file, load the specific value and close it again? Would this process be more efficient than what I have done above?
Thanks for all the answers, all answers so far work pretty well.
You could try reading the file without csv library:
row = 108
column = 80
with open('hugetable.csv', 'r') as file:
header = next(file)
for _ in range(row-1):
_ = next(file)
line = next(file)
print(line.strip().split(',')[column])
You can try pandas to load only certain columns of your csv file
import pandas as pd
pd.read_csv('foo.csv',usecols=["column1", "column2"])
You could use pandas to load it
import pandas as pd
text = pd.read_csv('Book1.csv', sep=',', header=None, skiprows= 100, nrows=3)
print(text[50])

Extracting metadata from csv without loading data in python

I am trying to get the dimensions (shape) of a data frame using pandas in python without reading the entire data frame first in memory given that the file is quite large.
To get the number of columns with minimal loading of the file into the memory, I can for example use the argument below.
import pandas as pd
pd = pd.read_csv("myData.csv", nrows=1)
print(pd.shape)
To get the row numbers I can use the argument usecols = [1] when reading the file but there must be a simpler way of doing this.
If there are other packages or scripts that can easily give me such metadata information, I would be happy as well. It is really metadata I am looking for such as column names, number of rows, number of columns etc but I don't want to read the entire file in!
You don't even need pandas for this. Use the built-in csv module to parse the file:
import csv
with open('myData.csv')as fp:
reader = csv.reader(fp)
headers = next(reader) # The header row is now consumed
ncol = len(headers)
nrow = sum(1 for _ in reader) # What remains are the data rows

Python: How to create a new dataframe with first row when a specific value

I am reading csv files into python using:
df = pd.read_csv(r"C:\csvfile.csv")
But the file has some summary data, and the raw data start if a value "valx" is found. If "valx" is not found then the file is useless. I would like to create news dataframes that start when "valx" is found. I have been trying for a while with no success. Any help on how to achieve this is greatly appreciated.
Unfortunately, pandas only accepts skiprows for rows to skip in the beginning. You might want to parse the file before creating the dataframe.
As an example:
import csv
with open(r"C:\csvfile.csv","r") as f:
lines = csv.reader(f, newline = '')
if any('valx' in i for i in lines):
data = lines
Using the Standard Libary csv module, you can read file and check if valx is in the file, if it is found, the content will be returned in the data variable.
From there you can use the data variable to create your dataframe.

Categories

Resources