I have a dataset in csv file which contains one of the column as list(or dict which further includes several semi colons and commas because of key, value pair). Now trouble is accessing with Pandas and it is return mixed values because of the reason that it has several commas in the list which is in fact a single column.
I have seen several solutions such as use "" or ; as delimiter, but problem is I already have the data, find and replace will completely change my dataset.
example of csv is :
data_column1, data_column2, [{key1:value1},{key2:value2}], data_column3
Please advise any faster way to access specific columns of the data with out any ambiguity.
You can only set the delmiter to one character so you can't use square brackets in this Way. You would need to use a single character such as " so that it knows to ignore the commas between the delmieters.
you can try converting the column using melt function. here is the link to the documentation:
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.melt.html
Related
I have a Pandas DataFrame that looks like this.
DataFrame picture
I thought of save a tuple of two values under a column and then retrieve whichever value is needed. But now, for example, if I want the first value in the tuple located at the first row of the 'Ref' column, I get "(" instead of "c0_4"
df = pd.read_csv(df_path)
print(df['Ref'][0][0])
The output for this is "(" and not "c0_4".
I don't want to use split() because I want the values to be searchable in the dataframe. For example, I would want to search for "c0_8" under the "Ref" column and get the row.
What other alternatives do I have to save two values in a row under the same column?
The immediate problem is that you're simply accessing character 0 of a string.
file is character-oriented storage; there is no "data frame" abstraction. Hence, we use CSV to hold the columnar data as text, a format that allows easy output and input recovery.
A CSV file consists only of text, with the separator character and newline having special meanings. There is no "tuple" form. Your data frame is stored as string data. If you want to recover your original tuple form, you will need to write parsing code to convert the strings back to tuples. Alternately, you can switch to a "pickle" (PCL) format for storing your data.
That should be enough leads to allow you to research whatever alternatives you want.
Your data is stored as a string
To format it into a tuple, split every string in your DataFrame and save it back as a tuple, with something like:
for n...
for m...
df[n][m] = (df[n][m].split(",")[0][1:], df[n][m].split(",")[1][:-1])
I notice a lot of Pandas questions on Stack Overflow only include a few rows of their data as text, without the accompanying code to generate/reproduce it. I am aware of the existence of read_clipboard, but I am unable to figure out how to effectively call this function to read data in many situations, such as when there are white spaces in the header names, or Python objects such as lists in the columns.
How can I use pd.read_clipboard more effectively to read data pasted in unconventional formats that don't lend themselves to easy reading using the default arguments? Are there situations where read_clipboard comes up short?
read_clipboard: Beginner's Guide
read_clipboard is truly a saving grace for anyone starting out to answer questions in the Pandas tag. Unfortunately, pandas veterans also know that the data provided in questions isn't always easy to grok into a terminal due to various complications in the format of the data posted.
Thankfully, read_clipboard has arguments that make handling most of these cases possible (and easy). Here are some common use cases and their corresponding arguments.
Common Use Cases
read_clipboard uses read_csv under the hood with white space separator, so a lot of the techniques for parsing data from CSV apply here, such as
parsing columns with spaces in the data
use sep with regex argument. First, ensure there are at least two spaces between columns and at most one consecutive white space inside the column's data itself. Then you can use sep=r'\s{2,}' which means "separate columns by looking for at least two consecutive white spaces for the separator" (note: engine='python' is required for multicharacter or regex separators):
df = pd.read_clipboard(..., sep=r'\s{2,}', engine='python')
Also see How do you handle column names having spaces in them when using pd.read_clipboard?.
reading a series instead of DataFrame
use squeeze=true, you would likely also need header=None if the first row is also data.
s = pd.read_clipboard(..., header=None, squeeze=True)
Also see Could there be an easier way to use pandas read_clipboard to read a Series?.
loading data with custom header names
use names=[...] in conjunction with header=None and skiprows=[0] to ignore existing headers.
df = pd.read_clipboard(..., header=None, names=['a', 'b', 'c'], skiprows=[0])
loading data without any headers
use header=None
set one or more columns as the index
use index_col=[...] with the appropriate label or index
parsing dates
use parse_dates with the appropriate format. If parsing datetimes (i.e., columns with date separated by timestamp), you will likely also need to use sep=r'\s{2,}' while ensuring your columns are separated by at least two spaces.
See this answer by me for a more comprehensive list on read_csv arguments for other cases not covered here.
Caveats
read_clipboard is a Swiss Army knife. However, it
cannot read data in prettytable/tabulate formats (IOW, borders make it harder)
See Reading in a pretty-printed/formatted dataframe using pd.read_clipboard? for solutions to tackle this.
cannot correctly parse MultIndexes unless all elements in the index are specified.
See Copying MultiIndex dataframes with pd.read_clipboard? for solutions to tackle this.
cannot ignore/handle ellipses in data
my suggested method is to manually remove ellipses before printing
cannot parse columns of lists (or other objects) as anything other than string. The columns will need to be converted separately, as shown in How do you read in a dataframe with lists using pd.read_clipboard?.
cannot read text from images (so please don't use images as a means to share your data with folks, please!)
The one weakness of this function is that it doesn't capture contents of Ctrl + C if the copy is performed from a PDF file. Testing it this way results in an empty read.
But by using a regular text editor, it goes just fine. Here is an example using randomly typed text:
>>> pd.read_clipboard()
Empty DataFrame
Columns: [sfsesfsdsxcvfsdf]
Index: []
I'm working on a small code generation app that loads in an Excel file (using pandas ExcelFile + xlrd) which is then parsed to a dataframe (ExcelFile.parse) for several SQL-like operations. The stored data is then returned to a file writer as a list using map and lambda functions with a little f-string formatting on the specific fields.
The problem I'm having is that not all fields in the Excel file are predictably populated, so I'm using fillna('') during the parsing to dataframe, but when I come to the f-string, the unpopulated fields will cause an error when I apply :.0f formatting to remove the decimals. If I don't use the fillna('') function, the floats will format correctly, but I then have multiple entries of nan as a string value that I can't work out how to convert to ''.
As an example, the below will fail with fillna('') as NumField3 and NumField 4 can be empty in the source spreadsheet.
return list(
map(
lambda row: f"EXEC ***_****_*.****_Register_File("
f"{row['NumField1']:.0f},{row['NumField2']:.0f},"
f"'{row['TextField1']}','{row['TextField2']}',"
f"'{row['TextField3']}','{row['TextField4']}',"
f"{row['NumField3']:.0f},{row['NumField4']:.0f});\n",
df.to_dict("records")))
My original approach was using .format() and itertuples(), but this was apparently a less efficient way. I've opted for the conversion to dictionary so I can retain the field names in the list construction for easier supportability.
I'm probably missing something really simple, but I can't see the wood for the trees at the moment. Any suggestions?
I think I've worked it out. I've removed the fillna('') from the parsing of the ExcelFile object to dataframe, which results in the NaN value being presented in unpopulated fields. When the dataframe records are eventually processed through the map lambda approach, the original NaN value is presented as the string 'nan', so I've included a re.sub to look for that value as a whole word and replace it with the required empty string.
It's not pretty but it works.
return list(
re.sub(r'\bnan\b', '', i) for i in map(
lambda row: f"EXEC ***_****_*.****_Register_File("
f"{row['NumField1']:.0f},{row['NumField2']:.0f},"
f"'{row['TextField1']}','{row['TextField2']}',"
f"'{row['TextField3']}','{row['TextField4']}',"
f"{row['NumField3']:.0f},{row['NumField4']:.0f});\n",
df.to_dict("records")))
I want to concatenate two columns in pandas containing mostly string values and some missing values. The result should be a new column which again contain string values and missings. Mostly it just worked fine with this:
df['newcolumn']=df['column1']+df['column2']
Most of the values in column1 are numbers (interpreted as strings) like 82. But some of the values in column2 are a composition of letters and numbers starting with an E like E52 or E83. When now 82 and E83 are concatenated, the result I want is 82E83. Unfortunately the results then is 8,2E+84. I guess, Python implicitly interpeted this as a number with scientific notation.
I already tried different ways of concatenating and forcing string format, but the result is always the same:
df['newcolumn']=(df['column1']+df['column2']).asytpe(str)
or
df['newcolumn']=(df['column1'].str.cat(df['column2'])).asytpe(str)
It seems Python first create a float, creating this not wanted format and then change the type to string, keeping results like 8,2E+84. Is there a solution for strictly keeping string format?
Edit: Thanks for your comments. As I tried to reproduce the problem myself with a very short dataframe, the problem also didn't occur. Finally I realized that it was only a problem with Excel automatically intepreting the cells as (wrong) numbers (in the CSV-Output). I didn't realize it before, because another dataframe coming from a CSV-File I used for merging with this dataframe on this concatenated strings was also already "destroyed" the same way by Excel. So the merging didn't work properly and I thought the concatenating in Python is the problem. I used to view the dataframe with Excel because it is really big. In the future I will be more carefully with this. My apologies for misplacing the problem!
Type conversion is not required in this case. You can simply use
df["newcolumn"] = df.apply(lambda x: f"{str(x[0])}{str(x[1])}", axis = 1)
Output:
I am fetching a column from a Dataframe. The column is of string type.
x = "[{somevalues, id:1, name:'xyz'}, {address:Some Value}, {somevalue}]" & so on..
The data is stored as a string. It can be easily represented as a list.
I want the output to be:
LIST of [
{somevalues, id:1, name:'xyz'},
{address:Some Value},
{somevalue}
]
How can I achieve this using Spark's API? I know that with Python I can use the eval(x) function and it will return the list or I can use the x.split() function, which will also return a list. However, in this approach, it needs to iterate for each record.
Also, I want to use mapPartition; that is the reason why I need my string column to be in a list so that I can pass it to mapPartition.
Is there an efficient way where I can also convert my string data using spark API or would mapPartitions be even better as I will be looping every partition rather than every record?
You can use regexp_replace to remove the square brackets and then split on the comma. At first, I'd thought you need to do something special to avoid splitting on the commas within the curly brackets. But it seems spark sql automatically avoids that. For example, the following query in Zeppelin
%sql
select split(regexp_replace("[{somevalues, id:1, name:'xyz'}, {address:Some Value}, {somevalue}]", "[\\[\\] ]", ""), ",")
gives me
WrappedArray({somevalues, id:1, name:'xyz'}, {address:SomeValue}, {somevalue})
which is what you want.
You can use withColumn to add a column in this way if you're working with dataframes. And for some reason, if the comma within the curly brackets is being split on, you can do more regex-foo as in this post - Regex: match only outside parenthesis (so that the text isn't split within parenthesis)?.
Hope that makes sense. I'm not sure if you're using dataframes, but they're recommended over the lower level RDD api.
If you don't want to go to dataframes then you can use regex replace and split functions on the rdd data you created .
If you have data as
x = "[{somevalues, id:1, name:'xyz'}, {address:Some Value}, {somevalue}]"
Then you can create rdd and use regex replace and split functions as
import re
rdd = sc.parallelize([x]).flatMap(lambda x: re.sub("},\\{", "};&;{", re.sub("[\\[\\]\s+]", "", x)).split(";&;"))
flatMap is used so that the splitted data comes in separate rows as
{somevalues,id:1,name:'xyz'}
{address:SomeValue}
{somevalue}
I hope the answer is helpful
Note : If you want the solution in dataframe way then you can get ideas from my other answer