Spliting a DataFrame column to columns with regex?

Spliting a DataFrame column to columns with regex?

I have a DataFrame with two columns the second one has following format :

1 continuity 1.0000e+00 x-velocity 0.0000e+00 y-velocity 4.4010e-02 z-velocity 9.5681e-04 energy 1.1549e-07 2 continuity 1.0000e+00 x-velocity 7.8788e-04 y-velocity 1.2617e+01 z-velocity 9.0445e-04 energy 4.5605e-06 3 continuity 2.3250e-01 x-velocity 1.6896e-03 y-velocity 1.2536e-02 z-velocity 9.8980e-03 energy 3.4032e-06 4 continuity 8.0243e-02 x-velocity 2.2180e-03 y-velocity 1.3189e-02 z-velocity 1.0225e-02 energy 4.6336e-07 5 {continuity 7.0923e-02 x-velocity 2.2674e-03 y-velocity 1.2308e-02

And I'm trying to use regex to split it into columns, by getting the first number, then getting all the numbers in between the brackets "" and give them the following names:

names=['iter', 'x', 'x-vel', 'y-vel', 'z-vel', 'energy']

However I just don't seem to make the regular expression work, here's what i'm doing in a simple example:

Input

>>> a = "1 continuity 1.0000e+00 x-velocity 0.0000e+00 y-velocity 4.4010e-02 z-velocity 9.5681e-04 energy 1.1549e-07 " >>> re.findall("(d*) {*D*(.*?)", a)

Result

>>> [('1', '1.0000e+00'), ('', '0.0000e+00'), ('', '4.4010e-02'), ('', '9.5681e-04'), ('', '1.1549e-07')]

As you can see my regex keeps looking for a number for every occurrence, but I don't want that to happen, how to do so?

Expected Behavior

>>> [('1'), ('1.0000e+00'), ('0.0000e+00'), ('4.4010e-02'), ('9.5681e-04'), ('1.1549e-07')]

When my regular expression works, I'm trying to assign all the columns with a line that would look something like this:

df[names] = df.first.str.extract(r'(d*) {*D*(.*?)', expand=True)

I'm really new to dataframes, is this the correct approach for this problem?

Any help would be much appreciated, thanks in advance!

1 Answer
1

First, let's make a series from some data in the question.

import pandas as pd data = pd.Series(''' 1 continuity 1.0000e+00 x-velocity 0.0000e+00 y-velocity 4.4010e-02 z-velocity 9.5681e-04 energy 1.1549e-07 2 continuity 1.0000e+00 x-velocity 7.8788e-04 y-velocity 1.2617e+01 z-velocity 9.0445e-04 energy 4.5605e-06 3 continuity 2.3250e-01 x-velocity 1.6896e-03 y-velocity 1.2536e-02 z-velocity 9.8980e-03 energy 3.4032e-06 4 continuity 8.0243e-02 x-velocity 2.2180e-03 y-velocity 1.3189e-02 z-velocity 1.0225e-02 energy 4.6336e-07 ''' .split('n')) print(data)

0 1 {continuity 1.0000e+00 {x-velocity 0.0000e... 1 2 {continuity 1.0000e+00 x-velocity 7.8788e... 2 3 continuity 2.3250e-01 x-velocity 1.6896e... 3 4 continuity 8.0243e-02 x-velocity 2.2180e... dtype: object 0

The first option is a simple regex to find all numbers in order. Use extractall to find every match in each string. This may be good enough. You still have to name the columns, which isn't hard. This will be have a MultiIndex (which is a little more advanced), since each match could have multiple groups (but this regex only has one group), hence the need to .unstack() it.

extractall

MultiIndex

.unstack()

print(data.str.extractall(r'(d[d.e+-]*)').unstack())

match 0 1 2 3 4 5 0 1 1.0000e+00 0.0000e+00 4.4010e-02 9.5681e-04 1.1549e-07 1 2 1.0000e+00 7.8788e-04 1.2617e+01 9.0445e-04 4.5605e-06 2 3 2.3250e-01 1.6896e-03 1.2536e-02 9.8980e-03 3.4032e-06 3 4 8.0243e-02 2.2180e-03 1.3189e-02 1.0225e-02 4.6336e-07

Alternatively, you can use a named groups regex. This is a fairly straightforward template to build from one of the strings. This will put the names from the regex into the columns. But the names must be valid Python identifiers. (x_vel, not x-vel). But this is probably what you want anyway, since it lets you access the columns as attrs (like df.x_vel instead of df['x-vel']). This (?P<foo>...) named group syntax is all explained in the re module docs.

x_vel

x-vel

df.x_vel

df['x-vel']

(?P<foo>...)

re

print( data.str.extract(r'(?P<iter>d+) continuity (?P<x>[^]+) x-velocity (?P<x_vel>[^]+) y-velocity (?P<y_vel>[^]+) z-velocity (?P<z_vel>[^]+) energy (?P<energy>[^]+) ', expand=False))

iter x x_vel y_vel z_vel energy 0 1 1.0000e+00 0.0000e+00 4.4010e-02 9.5681e-04 1.1549e-07 1 2 1.0000e+00 7.8788e-04 1.2617e+01 9.0445e-04 4.5605e-06 2 3 2.3250e-01 1.6896e-03 1.2536e-02 9.8980e-03 3.4032e-06 3 4 8.0243e-02 2.2180e-03 1.3189e-02 1.0225e-02 4.6336e-07

Note that we're using extract instead of extractall here because there are multiple groups in the regex itself.

extract

extractall

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Search This Blog

YTjnti