This is a question from someone who tries to answer questions about pandas dataframes. Consider a question with a given dataset which is just the visualization (not the actual code), for example:
numbers letters dates all
0 1 a 20-10-2020 NaN
1 2 b 21-10-2020 b
2 3 c 20-11-2020 4
3 4 d 20-10-2021 20-10-2020
4 5 e 10-10-2020 3.14
Is it possible to quickly import this in python as a dataframe or as a dictionary? So far I copied the given text and transformed it to a dataframe by making strings (adding '') and so on.
I think there are two 'solutions' for this:
Make a function that given the text as input, it somehow transforms it to a dataframe.
Use some function in the text-editor (I use spyder) which can do this trick for us.
read_clipboard
You can use pd.read_clipboard() optionally with a separator (e.g. pd.read_clipboard('\s\s+') if you have datetime strings or spaces in column names and columns are separated by at least two spaces):
select text on the question and copy to clipboard (ctrl+c/command-c)
move to python shell or notebook and run pd.read_clipboard()
Note that this doesn't work well on all platforms.
read_csv + io.StringIO
For more complex formats, combine read_csv combined with io.StringIO:
data = '''
numbers letters dates all
0 1 a 20-10-2020 NaN
1 2 b 21-10-2020 b
2 3 c 20-11-2020 4
3 4 d 20-10-2021 20-10-2020
4 5 e 10-10-2020 3.14
'''
import io
df = pd.read_csv(io.StringIO(data), sep='\s+')
df
Related
I have about 500 dataframes which I need to format in a specific fashion. I've already written the code to accomplish the formatting, but I need help applying this code to each of the dataframes before I will export to excel using openpyxl. I've attempted to use a for loop, to iterate through a variable which contains all the dataframes, but I keep running into issues. I've read about using dictionaries to store all the dataframes, but I'm not sure this is viable as some of my dataframes may contain the same number and I don't want them to be erroneously excluded as a duplicate. I'm really new to python and pandas, so pardon my ignorance, please!
An examples of my starting data frames:
0 Trace Length
1 domain1 10
2 length1 1
3 length2 2
4 length3 3
5 width1 4
6 width2 5
7 width3 6
What it should look like after my formatting function:
0 Trace Length new_trace new_length
1 domain1 10 NaN NaN
2 length1 1 width1 4
3 length2 2 width2 5
4 length3 3 width3 6
The code I've written thus far..
import pandas as pd
import openpyxl
df1 = pd.read_csv(Path...1)
df2 = pd.read_csv(Path...2)
df3 = pd.read_csv(Path...3)
alldfs = (df1, df2, df3)
def format(df):
result = (df['Trace'].str.extract('(\d+)', expand = False))
mask = df['Trace'].str.contains('width')
df= df.loc[~mask].join(df
.loc[mask],rsuffix='_new').rename(lambda x:'_'.join(x.split('_') [::-1]),axis=1).reset_index(drop=True)
df.to_excel(f"new_{df}.xlsx")
# How do I code this last part of the function so that it dynamically exports the current indexed df in the alldfs variable?
for df in alldfs:
format(df)
Any help figuring this out would be so helpful, I've been at it all day, and haven't made much progress. Thanks!
I am working with a large dataset with a column for reviews which is comprised of a series of strings for example: "A,B,C" , "A,B*,B" etc..
for example,
import pandas as pd
df=pd.DataFrame({'cat1':[1,2,3,4,5],
'review':['A,B,C', 'A,B*,B,C', 'A,C', 'A,B,C,D', 'A,B,C,A,B']})
df2 = df["review"].str.split(",",expand = True)
df.join(df2)
I want to split that column up into separate columns for each letter, then add those columns into the original data frame. I used df2 = df["review"].str.split(",",expand = True) and df.join(df2) to do that.
However, when i use df["A"].unique() there are entries that should not be in the column. I only want 'A' to appear there, but there is also B and C. Also, B and B* are not splitting into two columns.
My dataset is quite large so I don't know how to properly illustrate this problem, I have tried to provide a small scale example, however, everything seems to be working correctly in this example;
I have tried to look through the original column with df['review'].unique() and all entries were entered correctly (no missing commas or anything like that), so I was wondering if there is something wrong with my approach that would influence it to not work correctly across all datasets. Or is there something wrong with my dataset.
Does anyone have any suggestions as to how I should troubleshoot?
when i use df["A"].unique() there are entries that should not be in the column. I only want 'A' to appear there
IIUC, you wanted to create dummy variables instead?
df2 = df.join(df['review'].str.get_dummies(sep=',').pipe(lambda x: x*[*x]).replace('',float('nan')))
Output:
cat1 review A B B* C D
0 1 A,B,C A B NaN C NaN
1 2 A,B*,B,C A B B* C NaN
2 3 A,C A NaN NaN C NaN
3 4 A,B,C,D A B NaN C D
4 5 A,B,C,A,B A B NaN C NaN
I have a csv-file which looks like this:
A B C
1 2 3 4 5 6 7
8 9 1 2 3 4 5
When I read in this file using this code:
df2 = pd.read_csv(r'path\to\file.csv',delimiter=';')
I get a pandas dataframe which exist of three columns named A, B and C.
The first five rows of my actual csv file are taken as index and the last two rows are stated in column A and B and in C I only get NaN values.
Instead I would like to get a dataframe with columns A, B and C as the first three columns and the rest Unnamed columns. I think it is maybe due to a formatting issue with my csv-file but I do not know hwo to solve this..
Thanks a lot!
Try this:
df2 = pd.read_csv(r'path\to\file.csv',delimiter=' ', names=['A','B','C','D','E','F','G'], skiprows=1,index_col=False)
I have a dataframe with multiple columns and most have special characters like $, % or ^ and so on... How can I delete these characters throughout the entire data frame? I only know how to delete by column, for example:
df['Column'] = df['Column'].str.replace('^\d+','')
Just noticed that pandas.DataFrame.replace does not work with special characters like $, %, ^ etc. So, you can use following snippet to get rid of these special characters from the whole dataframe. We need to make sure that a certain column is of type string before applying str.replace
import pandas as pd
from pandas.api.types import is_string_dtype
f = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':['f;','d:','sda$sd'],
'D':['s%','d;','d^p'],
'E':[5,3,6],
'F':[7,4,3]})
f looks like follows.
A B C D E F
0 1 4 f; s% 5 7
1 2 5 d: d; 3 4
2 3 6 sda$sd d^p 6 3
Now use a loop to replace the strings.
for col in f.columns:
if is_string_dtype(f[col]):
f[col] = f[col].str.replace('[^A-Za-z0-9-\s]+', '')
Output:
A B C D E F
0 1 4 f s 5 7
1 2 5 d d 3 4
2 3 6 sdasd dp 6 3
UPDATE:
The pandas version 0.24.1 does not replace some special characters, but versions 0.23.4 and 0.25.1 do work. So if you have any of these working versions, you can easily use pandas.DataFrame.replace to delete the special characters as follows. Make sure to escape these characters with \.
f = f.replace({'\$':'', '\^':'','%':''}, regex=True)
This results in the same output as above.
I think you want:
pandas.DataFrame.replace(to_replace, value)
The parameters accept regex and it should cover the whole df. Hope this helps.
Here's the documentation on this method:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html#pandas.DataFrame.replace
I have a dataframe in pandas with four columns. The data consists of strings. Sample:
A B C D
0 2 asicdsada v:cVccv u
1 4 ascccaiiidncll v:cVccv:ccvc u
2 9 sca V:c u
3 11 lkss v:cv u
4 13 lcoao v:ccv u
5 14 wuduakkk V:ccvcv: u
I want to replace the string 'u' in Col D with the string 'a' if Col C in that row contains the substring 'V' (case sensitive).
Desired outcome:
A B C D
0 2 asicdsada v:cVccv a
1 4 ascccaiiidncll v:cVccv:ccvc a
2 9 sca V:c a
3 11 lkss v:cv u
4 13 lcoao v:ccv u
5 14 wuduakkk V:ccvcv: a
I prefer to overwrite the value already in Column D, rather than assign two different values, because I'd like to selectively overwrite some of these values again later, under different conditions.
It seems like this should have a simple solution, but I cannot figure it out, and haven't been able to find a fully applicable solution in other answered questions.
df.ix[1]["D"] = "a"
changes an individual value.
df.ix[:]["C"].str.contains("V")
returns a series of booleans, but I am not sure what to do with it. I have tried many many combinations of .loc, apply, contains, re.search, and for loops, and I get either errors or replace every value in column D. I'm a novice with pandas/python so it's hard to know whether my syntax, methods, or conceptualization of what I even need to do are off (probably all of the above).
As you've already tried, use str.contains to get a boolean Series, and then use .loc to say "change these rows and the D column". For example:
In [5]: df.loc[df["C"].str.contains("V"), "D"] = "a"
In [6]: df
Out[6]:
A B C D
0 2 asicdsada v:cVccv a
1 4 ascccaiiidncll v:cVccv:ccvc a
2 9 sca V:c a
3 11 lkss v:cv u
4 13 lcoao v:ccv u
5 14 wuduakkk V:ccvcv: a
(Avoid using .ix -- it's officially deprecated now.)