DataFrame is saving brackets while exporting to csv - python

I have a Pandas DataFrame that looks like this.
DataFrame picture
I thought of save a tuple of two values under a column and then retrieve whichever value is needed. But now, for example, if I want the first value in the tuple located at the first row of the 'Ref' column, I get "(" instead of "c0_4"
df = pd.read_csv(df_path)
print(df['Ref'][0][0])
The output for this is "(" and not "c0_4".
I don't want to use split() because I want the values to be searchable in the dataframe. For example, I would want to search for "c0_8" under the "Ref" column and get the row.
What other alternatives do I have to save two values in a row under the same column?

The immediate problem is that you're simply accessing character 0 of a string.
file is character-oriented storage; there is no "data frame" abstraction. Hence, we use CSV to hold the columnar data as text, a format that allows easy output and input recovery.
A CSV file consists only of text, with the separator character and newline having special meanings. There is no "tuple" form. Your data frame is stored as string data. If you want to recover your original tuple form, you will need to write parsing code to convert the strings back to tuples. Alternately, you can switch to a "pickle" (PCL) format for storing your data.
That should be enough leads to allow you to research whatever alternatives you want.

Your data is stored as a string
To format it into a tuple, split every string in your DataFrame and save it back as a tuple, with something like:
for n...
for m...
df[n][m] = (df[n][m].split(",")[0][1:], df[n][m].split(",")[1][:-1])

Related

I have a CSV generated from CICFLOWMETER and I'm unable generate a correlation matrix It either generates a empty data frame

This is the code I'm using and I have also tried converting my datatype of my columns which is object to float but I got this error
df = pd.read_csv('DDOSping.csv')
pearsoncorr = df.corr(method='pearson')
ValueError: could not convert string to float:
'172.27.224.251-172.27.224.250-56003-502-6'
Somewhere in your CSV this string value exists '172.27.224.251-172.27.224.250-56003-502-6'. Do you know why it's there? What does it represent? It looks to me like it shouldn't be in the data you include in your correlation matrix calculation.
The df.corr method is trying to convert the string value to a float, but it's obviously not possible to do because it's a big complicated string with various characters, not a regular number.
You should clean your CSV of unnecessary data (or make a copy and clean that so you don't lose anything important). Remove anything, like metadata, that isn't the exact data that df.corr needs, including the string in the error message.
If it's just a few values you need to clean then just open in excel or a text editor to do the cleaning. If it's a lot and all the irrelevant data to be removed is in specific rows and/or columns, you could just remove them from your DataFrame before calling 'df.corr' instead of cleaning the file itself.

How to search for a series of integers in a text column in Python and Postgres

I have a comma separated string in Python which I would like to compare to a text column in Postgres. The column looks like this:
data
{"ids":[id1]}
{"ids":[id2,id3,id4]}
{"ids":[id5,id6,id7]}
Comma separated string in python looks like this:
id1,id2,id3..
All the ids are integer values. I'd like to take all the ids in the comma separated string and match them with the data column in Postgres and return matching results. The Postgres table is very large, so I cannot download it in memory and perform string comparison. The data column is text.
I could easily match if I had only 1 id to match using the like operator. But I do not know how to match a bunch of comma separated values.
Expected resultset in the sample case is:
data
{"ids":[id1]}
{"ids":[id2,id3,id4]}
How do I go about it?

Pandas df.loc with regex

I'm working with a data set consisting of several csv files of nearly the same form. Each csv describes a particular date, and labels the data by state/province. However, the format of one of the column headers in the data set was altered from Province/State to Province_State, so that all csv's created before a certain date use the first format and all csv's created after that date use the second format.
I'm trying to sum up all the entries corresponding to a particular state. At present, the code I'm working with is as follows:
daily_data.loc[daily_data[areaLabel] == location].sum()
where daily_data is the dataframe containing the csv data, location is the name of the state I'm looking for, and arealabel is a variable storing either 'Province/State' or 'Province_State' depending on the result of a date check. I would like to eliminate the date check by e.g. conditioning on a regular expression like Province(/|_)State, but I'm having a lot of trouble finding a way to index into a pandas dataframe by regular expression. Is this doable (and in a way that would make the code more elegant rather than less elegant)? If so, I'd appreciate it if someone could point me in the right direction.
Use filter to get the columns that match your regex
>>> df.filter(regex="Province(/|_)State").columns[0]
'Province/State'
Then use this to select only rows that match your location:
df[df[df.filter(regex="Province(/|_)State").columns[0]]==location].sum()
This however assumes that there are no other columns that would match the regex.

Pyspark, find all the data types of a column

I am doing some data cleaning and data profiling work. So the given data could be quite messy.
I want to get all the potential data types of a column using pyspark.
Data types like:
Integer
Real number
Date/Time
String (Text)
etc
I will need to do more processing to generate the respective metadata based on what types do the column has.
A column can contain more than one type. I don't mean the built-in data types. The given data are all of type string, but some are in the form of "1234" which is actually an int, and some are in the form of "2019/11/19", which is actually a date.
For example, the column number could contain values like
"123"
"123.456"
"123,456.789"
"NUMBER 123"
In the above example, the data types would be INTEGER, REAL NUMBER, STRING.
If I used df.schema[col].dataType, it just simply give me StringType.
I was thinking that I can iterate through the entire column, and use regex to see which type does the row belong to, but I am curious if there is some better way to do it, since it's a relatively large dataset.
Now I kind of solve the issue by iterating through the column, and do some type checking:
df = spark.sql('SELECT col as _col FROM table GROUP BY _col')
df.rdd.map(lambda s: typeChecker(s))
where in typeChecker I just check which type does s._col belongs to.
Thanks.

Access dict columns of a csv file in python pandas

I have a dataset in csv file which contains one of the column as list(or dict which further includes several semi colons and commas because of key, value pair). Now trouble is accessing with Pandas and it is return mixed values because of the reason that it has several commas in the list which is in fact a single column.
I have seen several solutions such as use "" or ; as delimiter, but problem is I already have the data, find and replace will completely change my dataset.
example of csv is :
data_column1, data_column2, [{key1:value1},{key2:value2}], data_column3
Please advise any faster way to access specific columns of the data with out any ambiguity.
You can only set the delmiter to one character so you can't use square brackets in this Way. You would need to use a single character such as " so that it knows to ignore the commas between the delmieters.
you can try converting the column using melt function. here is the link to the documentation:
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.melt.html

Categories

Resources