Pandas read_csv incorrectly naming columns - python

I am trying to import a Leukemia gene expression data set found at https://www.kaggle.com/brunogrisci/leukemia-gene-expression-cumida. This data set has a lot of columns (22285) and the columns imported towards the end have an incorrect name. For example the last column named AFFX-r2-P1-cre-3_at is actually called 217005_at in the csv file. The image below shows my juypter notebook cells. I am not sure why it is being formatted this way? Any help would be greatly appreciated.

Evidently the CSV file has column names that start with 'AFFX-r2-P1' -- it's not a pandas issue. Using the built-in csv package shows:
import csv
from pathlib import Path
data_file = Path('../../../Downloads/Leukemia_GSE9476.csv')
with open(data_file, 'rt') as lines:
csv_file = csv.reader(lines)
fields = next(csv_file)
#
[
(field_number, field)
for field_number, field in enumerate(fields)
if field.startswith('AFFX-r2-P1')
]
The output is:
[(22277, 'AFFX-r2-P1-cre-3_at'), (22278, 'AFFX-r2-P1-cre-5_at')]

Related

Python read csv, with four column

Data from datadog
I am looking for some assistance reading this data from Datadog, I am reading it from the downloaded cvs. Wants to read in python so that create an application for the reading the same on regular intervals.
I have tried reading the data like below
import pandas as pd
fileload = pd.read_csv("DataSource/extract-2023-02-02T19_10_32.790Z.csv")
print(fileload)
fileload1 = pd.read_csv("DataSource/extract-2023-02-02T19_11_05.899Z.csv")
final = pd.concat([fileload, fileload1])
print(final)````
import csv
with open("DataSource/extract-2023-02-02T19_10_32.790Z.csv", 'r' ) as file:
csvread = csv.reader(file)
for i in file:
print(i)
a = pd.DataFrame([csvread])
print(type(a))
My expectation is that i can pick the last column with the all the data in the above format and further give column names to it. and then analyse data applying some aggregations on top.
Please assist
Have you tried:
final[["final_column_name"]]
final['New_col_name'] = ...

CSV forgetting some columns when added to ArcGIS Pro

In the first photo []. I have all the columns I have desired (which the example I will talking about is Project Scope) in an Excel CSV file. When I go to add the the CSV into ArcPro (second photo) ArcGIS Pro drops the Project Scope column for some reason and it should be in between the Project Phase column and the Construction Finish column []. The CSV was generated using a python script that used Pandas function so that may be a lead on the case.
Has anyone encountered this before? Advice would be appreciated!
I tried creating a python script with pandas that would create an dictionary to bracket the columns(A:O) which Project Scope would be included but this came up to no avail and my issue remained. Here is the code I have
import arcpy
import pandas as pd
import csv
#project plan to csv
df = pd.read_excel(r'original.xlsx', usecols="A:O")
#convert to csv
df.to_csv(r'new.csv')
df = pd.read_csv(r'new.csv')
#generate key field
df["unique_id"] = df["City"] + "<>" + df["Project Status"] + '-' + df["Project Phase"]
df.to_csv(r'new.csv')
#isolate unique id column between new and old columns
with open(r'new.csv', 'w') as infile:
reader = csv.reader(infile)
newlist = [rows[15] for rows in reader]
with open(r'old.csv', 'w') as infile:
reader = csv.reader(infile)
oldist = [rows[15] for rows in reader]
I tried deleting some unnecessary columns in the CSV before adding the CSV into ArcGIS Pro, which showed the Project Scope field, but dropped other columns that are of importance.
Project scope has line breaks within. I bet that is causing problems. Look at the actual csv file you are trying to load in a text editor.

Import pipe delimited txt file into spark dataframe in databricks

I have a data file saved as .txt format which has a header row at the top, and is pipe delimited. I am working in databricks, and am needing to create a spark dataframe of this data, with all columns read in as StringType(), the headers defined by the first row, and the columns separated based on the pipe delimiter.
When importing .csv files I am able to set the delimiter and header options. However, I am not able to get the .txt files to import in the same way.
Example Data (completely made up)... for ease, please imagine it is just called datafile.txt:
URN|Name|Supported
12233345757777701|Tori|Yes
32313185648456414|Dave|No
46852554443544854|Steph|No
I would really appreciate a hand in getting this imported into a Spark dataframe so that I can crack on with other parts of the analysis. Thank you!
Any delimiter separated file is a good candidate for csv reading methods. The 'c' of csv is mostly by convention. Thus nothing stops us from reading this:
col1|col2|col3
0|1|2
1|3|8
Like this (in pure python):
import csv
from pathlib import Path
with Path("pipefile.txt").open() as f:
reader = csv.DictReader(f, delimiter="|")
data = list(reader)
print(data)
Since whatever custom reader your libraries are using probably uses csv.reader under the hood you simply need to figure out how to pass the right separator to it.
#blackbishop notes in a comment that
spark.read.csv("datafile.text", header=True, sep="|")
would be the appropriate spark call.

Multiple txt files as separate rows in a csv file without breaking into lines (in pandas dataframe)

I have many txt files (which have been converted from pdf) in a folder. I want to create a csv/excel dataset where each text file will become a row. Right now I am opening the files in pandas dataframe and then trying to save it to a csv file. When I print the dataframe, I get one row per txt file. However, when saving to csv file, the texts get broken and create multiple rows/lines for each txt file rather than just one row. Do you know how I can solve this problem? Any help would be highly appreciated. Thank you.
Following is the code I am using now.
import glob
import os
import pandas as pd
file_list = glob.glob(os.path.join(os.getcwd(), "K:\\text_all", "*.txt"))
corpus = []
for file_path in file_list:
with open(file_path, encoding="latin-1") as f_input:
corpus.append(f_input.read())
df = pd.DataFrame({'col':corpus})
print (df)
df.to_csv('K:\\out.csv')
Update
If this solution is not possible it would be also helpful to transform the data a bit in pandas dataframe. I want to create a column with the name of txt files, that is, the name of each txt file in the folder will become the identifier of the respective text file. I will then save it to tsv format so that the lines do not get separated because of comma, as suggested by someone here.
I need something like following.
identifier col
txt1 example text in this file
txt2 second example text in this file
...
txtn final example text in this file
Use
import csv
df.to_csv('K:\\out.csv', quoting=csv.QUOTE_ALL)

Python: How to create a new dataframe with first row when a specific value

I am reading csv files into python using:
df = pd.read_csv(r"C:\csvfile.csv")
But the file has some summary data, and the raw data start if a value "valx" is found. If "valx" is not found then the file is useless. I would like to create news dataframes that start when "valx" is found. I have been trying for a while with no success. Any help on how to achieve this is greatly appreciated.
Unfortunately, pandas only accepts skiprows for rows to skip in the beginning. You might want to parse the file before creating the dataframe.
As an example:
import csv
with open(r"C:\csvfile.csv","r") as f:
lines = csv.reader(f, newline = '')
if any('valx' in i for i in lines):
data = lines
Using the Standard Libary csv module, you can read file and check if valx is in the file, if it is found, the content will be returned in the data variable.
From there you can use the data variable to create your dataframe.

Categories

Resources